downloads a PDF - Columbia Software Systems Laboratory

Transcription

Virtualization Mechanisms for Mobility, Security
and System Administration
Shaya Potter
Submitted in partial fulfillment of the
requirements for the degree
of Doctor of Philosophy
in the Graduate School of Arts and Sciences
COLUMBIA UNIVERSITY
2010
c 2010
Shaya Potter
All Rights Reserved
ABSTRACT
Virtualization Mechanisms for Mobility, Security and System Administration
Shaya Potter
This dissertation demonstrates that operating system virtualization is an effective
method for solving many different types of computing problems. We have designed
novel systems that make use of commodity software while solving problems that were
not conceived when the software was originally written. We show that by leveraging
and extending existing virtualization techniques, and introducing new ones, we can
build these novel systems without requiring the applications or operating systems to
be rewritten.
We introduce six architectures that leverage operating system virtualization. *Pod
creates fully secure virtual environments and improves user mobility. AutoPod reduces the downtime needed to apply kernel patches and perform system maintenance.
PeaPod creates least-privilege systems by introducing the pea abstraction. Strata improves the ability of administrators to manage large numbers of machines by introducing the Virtual Layered File System. Apiary builds upon Strata to create a new form
of desktop security by using isolated persistent and ephemeral application containers.
Finally, ISE-T applies the two-person control model to system administration.
By leveraging operating system virtualization, we have built these architectures
on Linux without requiring any changes to the underlying kernel or user-space applications. Our results, with real applications, demonstrate that operating system
virtualization has minimal overhead. These architectures solve problems with minimal impact on end-users while providing functionality that would previously have
required modifications to the underlying system.
Contents
Contents
i
List of Figures
vii
List of Tables
ix
Acknowledgments
xi
1 Introduction
1
1.1
OS Virtualization Security and User Mobility . . . . . . . . . . . . .
3
1.2
Mobility to Improve Administration . . . . . . . . . . . . . . . . . . .
5
1.3
Isolating Cooperating Processes . . . . . . . . . . . . . . . . . . . . .
6
1.4
Managing Large Numbers of Machines . . . . . . . . . . . . . . . . .
6
1.5
A Desktop of Isolated Applications . . . . . . . . . . . . . . . . . . .
7
1.6
Two-Person Control Administration . . . . . . . . . . . . . . . . . . .
8
1.7
Technical Contributions . . . . . . . . . . . . . . . . . . . . . . . . .
9
2 Overview of Operating System Virtualization
12
2.1
Operating System Kernel Virtualization . . . . . . . . . . . . . . . .
13
2.2
File System Virtualization . . . . . . . . . . . . . . . . . . . . . . . .
14
2.3
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
i
3 *Pod: Improving User Mobility
3.1
20
*Pod Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.1.1
Secure Operating System Virtualization . . . . . . . . . . . .
24
3.2
Using a *Pod Device . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.3
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.4
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4 AutoPod: Reducing Downtime for System Maintenance
41
4.1
AutoPod Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.2
Migration Across Different Kernels . . . . . . . . . . . . . . . . . . .
45
4.3
Autonomic System Status Service . . . . . . . . . . . . . . . . . . . .
49
4.4
AutoPod Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.4.1
System Services . . . . . . . . . . . . . . . . . . . . . . . . . .
52
4.4.2
Desktop Computing . . . . . . . . . . . . . . . . . . . . . . .
53
4.4.3
Setting Up and Using AutoPod . . . . . . . . . . . . . . . . .
55
4.5
57
4.6
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
5 PeaPod: Isolating Cooperating Processes
63
5.1
PeaPod Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.2
PeaPod Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . .
68
5.2.1
Pea Virtualization . . . . . . . . . . . . . . . . . . . . . . . .
69
5.2.2
Pea Configuration Rules . . . . . . . . . . . . . . . . . . . . .
73
5.2.2.1
File System . . . . . . . . . . . . . . . . . . . . . . .
73
5.2.2.2
Transition Rules . . . . . . . . . . . . . . . . . . . .
76
ii
5.2.2.3
Networking Rules . . . . . . . . . . . . . . . . . . . .
77
5.2.2.4
Shared Namespace Rules . . . . . . . . . . . . . . . .
78
5.2.2.5
Managing Rules
. . . . . . . . . . . . . . . . . . . .
78
5.3
Security Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
5.4
Usage Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
5.4.1
Email Delivery . . . . . . . . . . . . . . . . . . . . . . . . . .
83
5.4.2
Web Content Delivery . . . . . . . . . . . . . . . . . . . . . .
85
5.4.3
Desktop Computing . . . . . . . . . . . . . . . . . . . . . . .
87
5.5
89
5.6
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
6 Strata: Managing Large Numbers of Machines
95
6.1
Strata Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2
Strata Usage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3
6.2.1
Creating Layers and Repositories . . . . . . . . . . . . . . . . 103
6.2.2
Creating Appliance Templates . . . . . . . . . . . . . . . . . . 103
6.2.3
Provisioning and Running Appliance Instances . . . . . . . . . 104
6.2.4
Updating Appliances . . . . . . . . . . . . . . . . . . . . . . . 105
6.2.5
Improving Security . . . . . . . . . . . . . . . . . . . . . . . . 106
Virtual Layered File System . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.1
Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3.2
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3.3
6.3.2.1
Dependency Example . . . . . . . . . . . . . . . . . 114
6.3.2.2
Resolving Dependencies . . . . . . . . . . . . . . . . 114
Layer Creation . . . . . . . . . . . . . . . . . . . . . . . . . . 116
iii
6.3.4
Layer Repositories . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3.5
VLFS Composition . . . . . . . . . . . . . . . . . . . . . . . . 119
6.4
Improving Appliance Security . . . . . . . . . . . . . . . . . . . . . . 122
6.5
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.6
6.5.1
Reducing Provisioning Times . . . . . . . . . . . . . . . . . . 126
6.5.2
Reducing Update Times . . . . . . . . . . . . . . . . . . . . . 127
6.5.3
Reducing Storage Costs . . . . . . . . . . . . . . . . . . . . . 128
6.5.4
Virtualization Overhead . . . . . . . . . . . . . . . . . . . . . 130
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7 Apiary: A Desktop of Isolated Applications
136
7.1
Apiary Usage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2
Apiary Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.3
7.2.1
Process Container . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.2.2
Display
7.2.3
File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.4
Inter-Application Integration . . . . . . . . . . . . . . . . . . . 148
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.3.1
Handling Exploits . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.3.1.1
Malicious Files . . . . . . . . . . . . . . . . . . . . . 154
7.3.1.2
Malicious Plugins . . . . . . . . . . . . . . . . . . . . 155
7.3.2
Usage Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.3.3
Performance Measurements . . . . . . . . . . . . . . . . . . . 161
7.3.3.1
Application Performance . . . . . . . . . . . . . . . . 161
7.3.3.2
Container Creation . . . . . . . . . . . . . . . . . . . 162
iv
7.4
7.3.4
File System Efficiency . . . . . . . . . . . . . . . . . . . . . . 165
7.3.5
File System Virtualization Overhead . . . . . . . . . . . . . . 167
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8 ISE-T: Two-Person Control Administration
173
8.1
Usage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.2
ISE-T Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.2.1
Isolation Containers
. . . . . . . . . . . . . . . . . . . . . . . 181
8.2.2
ISE-T’s File System
. . . . . . . . . . . . . . . . . . . . . . . 183
8.2.3
ISE-T System Service . . . . . . . . . . . . . . . . . . . . . . . 184
8.3
ISE-T for Auditing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.4
8.5
8.4.1
Software Installation . . . . . . . . . . . . . . . . . . . . . . . 190
8.4.2
System Services . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.4.3
Configuration Changes . . . . . . . . . . . . . . . . . . . . . . 193
8.4.4
Exploit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9 Conclusions and Future Work
198
Bibliography
203
A Restricted System Calls
221
A.1 Host-Only System Calls . . . . . . . . . . . . . . . . . . . . . . . . . 221
A.2 Root-Squashed System Calls . . . . . . . . . . . . . . . . . . . . . . . 223
A.3 Option-Checked System Calls . . . . . . . . . . . . . . . . . . . . . . 224
v
A.4 Per-Virtual-Environment System Calls . . . . . . . . . . . . . . . . . 225
vi
List of Figures
3.1
*Pod Virtualization Overhead . . . . . . . . . . . . . . . . . . . . . .
32
3.2
*Pod Checkpoint/Restart vs. Normal Startup Latency . . . . . . . .
34
4.1
AutoPod Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
5.1
PeaPod Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
5.2
Example of Read/Write Rules
. . . . . . . . . . . . . . . . . . . . .
74
5.3
Protecting a Device . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
5.4
Directory-Default Rule . . . . . . . . . . . . . . . . . . . . . . . . . .
75
5.5
Transition Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
5.6
Networking Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.7
Namespace Access Rules . . . . . . . . . . . . . . . . . . . . . . . . .
78
5.8
Compiler Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.9
Set of Multiple Rule Files . . . . . . . . . . . . . . . . . . . . . . . .
79
5.10 Email Delivery Configuration . . . . . . . . . . . . . . . . . . . . . .
84
5.11 Web Delivery Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
5.12 Desktop Application Rules . . . . . . . . . . . . . . . . . . . . . . . .
87
5.13 PeaPod Virtualization Overhead . . . . . . . . . . . . . . . . . . . . .
91
vii
6.1
How Layers, Repositories, and VLFSs Fit Together . . . . . . . . . . 101
6.2
Layer Definition for MySQL Server . . . . . . . . . . . . . . . . . . . 109
6.3
Layer Definition for Provisioned Appliance . . . . . . . . . . . . . . . 109
6.4
Metadata for MySQL Server Layer . . . . . . . . . . . . . . . . . . . 111
6.5
Metadata Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.6
Storage Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.7
Postmark Overhead in Multiple VAs . . . . . . . . . . . . . . . . . . 131
6.8
Kernel Build Overhead in Multiple VAs . . . . . . . . . . . . . . . . . 132
6.9
Apache Overhead in Multiple VAs . . . . . . . . . . . . . . . . . . . . 133
7.1
Apiary Screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.2
Usage Study Task Times . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.3
Application Performance with 25 Containers . . . . . . . . . . . . . . 162
7.4
Application Startup Time . . . . . . . . . . . . . . . . . . . . . . . . 164
7.5
Postmark Overhead in Apiary . . . . . . . . . . . . . . . . . . . . . . 167
7.6
Kernel Build Overhead in Apiary . . . . . . . . . . . . . . . . . . . . 168
8.1
ISE-T Usage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
viii
List of Tables
3.1
Per-Device *Pod File System Sizes . . . . . . . . . . . . . . . . . . .
31
3.2
Benchmark Descriptions . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.3
*Pod Checkpoint Sizes . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.1
Application Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4.2
AutoPod Migration Costs . . . . . . . . . . . . . . . . . . . . . . . .
59
5.1
Application Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . .
90
6.1
VA Provisioning Times . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2
VA Update Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.3
Layer Repository vs. Static VAs . . . . . . . . . . . . . . . . . . . . . 130
7.1
Application Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.2
File System Instantiating Times . . . . . . . . . . . . . . . . . . . . . 163
7.3
Apiary’s VLFS Layer Storage Breakdown . . . . . . . . . . . . . . . . 166
7.4
Comparing Apiary’s Storage Requirements Against a Regular Desktop 166
7.5
Update Times for Apiary’s VLFSs . . . . . . . . . . . . . . . . . . . . 166
8.1
ISE-T Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
ix
8.2
Administration Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
x
Acknowledgments
My deepest thanks go to my advisor, Jason Nieh, for his continual support and guidance. His constant questioning, demanding of explanations and objective evaluation
has helped develop ideas that I would not have been able to reach on my own, while
also teaching me skills that I hope remain with me. I am constantly amazed by how
many different studies, projects and papers he is able to juggle while retaining the
ability to ask insightful questions. He has provided the model that I aspire to be.
There are many people at Columbia who have been a significant part of my graduate experience. My officemates, Dinesh Subhraveti, Dan Phung and Dana Glasner
have been good friends, acted as sounding boards, provided valuable feedback, and,
in general, made the graduate experience an enjoyable one. I’ve worked on many
projects together with Ricardo Baratto and Oren Laadan and I am always amazed
by their abilities. Stelios Sidiroglou-Douskos, Mike Locasto, Carlo Pérez and Gong Su
provided valuable feedback and friendship. I’d also like to thank Angelos Keromytis,
Steven M. Bellovin for providing help and guidance in my research. In addition, I’d
like to thank Erez Zadok, Gail Kaiser and Chandra Narayanaswami for serving on my
Ph.D. committee. Finally, I’d be remiss if I did not thank the administrative staff in
the Computer Science Department, including Alice Cueba, Twinkle Edwards, Elias
xi
Tesfaye and Susan Tritto for handling many tasks that enabled me to focus on my
research.
Finally, I’d like to thank my parents, whose constant support and belief in me has
enabled all my accomplishments.
xii
Dedicated in memory of my grandmothers,
‫ יוכבד בת צבי הירש לייב‬and ‫אלתע מאשא בת חיים יצחק‬
They were proud of all my accomplishments and were always
looking forward to the day when my Ph.D. would be complete.
Their memory will be with me always.
xiii
Chapter 1
Introduction
Computer use is more widespread today than it was even 10 years ago, but we are
still using software designs from 20 or 30 years ago. Although these designs are
well tested and understood, they were created to solve the problems of that time.
Today’s users face difficulties that the original software designers did not imagine.
We can redesign the operating system and applications to attempt to address these
problems, but this creates new, relatively untested software and designs and may
force users and administrators to learn fundamentally new models of usage. This
dissertation demonstrates that many problems can be solved not by redesigning and
rewriting the applications, but instead by virtualizing the interfaces through which
existing applications interact with the operating system.
Virtualization is the creation of a layer of indirection between two entities that
previously communicated directly. For example, in hardware virtualization [28, 34,
142, 147], a virtual machine monitor (VMM) places a layer of indirection between an
operating system and the underlying hardware. A VMM provides a complete vir-
Chapter 1. Introduction
2
tualized hardware platform for an operating system, enabling any operating system
supporting that platform to run as though on physical hardware. Hardware virtualization has been shown to enable operating systems to take advantage of hardware
for which they were not designed. The Disco project [34] demonstrated how to run
an operating system not designed for ccNUMA architectures on those architectures
by using a VMM.
Operating systems can also be virtualized in multiple ways, most commonly by
providing each process with its own virtualized and protected memory mappings. Instead of letting a process directly access the machine’s memory, the operating system,
with hardware support, places a layer of indirection between the processes and physical memory, creating a virtualized mapping between the process’s memory space and
the physical machine’s memory space. This provides security, efficiency and flexibility. The processes’ memories are isolated from one another, but memory can still be
shared among processes.
Memory, however, is not the only operating system interface that can be virtualized. Zap [100] and FiST [152] demonstrated that an operating system’s kernel
state and file systems can be virtualized as well. Kernel virtualization operates by
virtualizing the system call interface, that is, by placing a layer of indirection between
processes and the system calls they use to access the operating system kernel’s functionality and ephemeral state. Similarly, file system virtualization works by placing a
layer of indirection between processes and the underlying physical file systems, or the
operating system’s persistent state. Instead of accessing the machine’s kernel and file
system directly using built-in system call and file system functions, the application
running in the virtualized operating system executes a function within the virtualization layer. The virtualization layer can modify the parameters passed to it, perform
work required by the desired virtualization, call built-in kernel and file system func-
3
tions to perform the desired real work, and modify the return value passed to the
calling process.
This dissertation demonstrates that by leveraging different forms of operating system virtualization, we can use commodity operating systems and software in novel
ways and solve problems that the original developers could not have anticipated. By
virtualizing the interfaces, we do not change the applications or operating system, but
instead create specialized environments that enable us to solve problems. Although
virtualized environments, from the perspective of processes, look and behave like the
system they are virtualizing, they can look and behave very differently to the systems on which they are hosted. This decoupling of execution environment and host
environment lets us create tools that run on the host and solve new problems without
modifying a well-tested operating system and application code. For example, we can
create virtual private namespaces for applications distinct from the namespace of the
physical computer. To the processes running within the virtualized environment, the
environment looks like a regular machine, provides the same application interface,
and does not require applications to be rewritten. Similarly, because operating system virtualization only interposes itself between the application and the underlying
operating system kernel, the underlying kernel’s binary and source code do not have
to be modified either.
1.1
OS Virtualization Security and User Mobility
Some forms of operating system virtualization [85, 100] are limited to isolating a
single user’s processes and are not designed to provide any security constraints. This
is especially noticeable for processes that run with elevated privileges, such as those
provided to root on Unix systems. Without secure virtualization, operating system
4
virtualization can only solve single user problems, substantially limiting its use. To
enable secure virtualization, we have enabled each virtualized environment to have a
unique set of virtualized users. Virtualizing the set of users gives each environment
an isolated set of privileges. However, unlike hardware virtualization, where each
virtual machine has a full operating system instance and therefore its own isolated
privileged state, operating systems generally only have a single set of privileged states.
Therefore, in addition to providing unique sets of virtualized users, we also restrict
the abilities of virtualized root users. If the virtualized root users were not restricted,
they could be treated equivalently to the root user of the underlying system, enabling
them to break the virtualization abstraction. This dissertation demonstrates how
operating system virtualization can be used to simply virtualize the set of users while
restricting the abilities of the privileged but virtualized root user.
We then show that operating system virtualization can be combined with checkpoint/restart functionality to improve mobile users’ computing experience. Many
users lug around bulky, heavy computers simply to have access to their data and
applications. To solve this problem, we created *Pod devices. A *Pod is a physical storage device, such as a portable hard disk or USB thumb drive, containing a
complete application-specific environment, such as a desktop or web environment.
*Pod devices run their applications on whatever host computer is available at the
user’s current location. By storing the entire environment on the portable device,
users can move it between computers while retaining a common usage environment.
Operating system virtualization, coupled with process migration technology, enables
users to move their running processes and data between physical machines, much like
a laptop can be suspended and resumed when changing locations. We have built a
number of *Pod devices that enable users to carry an application [109,110,113] or an
entire desktop [114] with them.
1.2
5
Mobility to Improve Administration
Building on *Pod, we demonstrate how operating system virtualization and checkpoint/restart ability can improve system maintenance, much of which requires taking
the machine offline and shutting down all active processes. Among other problems,
this prevents the kernel from being patched quickly. as it requires the machine to
be rebooted for the patch to take effect, thereby killing all running processes on the
machine. To address this, we developed AutoPod [112], a system that enables unscheduled operating system updates while preserving application service availability.
AutoPod leverages *Pod’s virtualization abstraction to provide a group of processes
and associated users with an isolated machine-independent virtualized environment
decoupled from the underlying operating system instance. This enables AutoPod to
run each independent service in its own isolated environment, preventing a security
fault in one from propagating to other services running on the same machine. This
virtualized environment is integrated with a checkpoint/restart system that allows
processes to be suspended, resumed and migrated across operating system kernel
versions with different security and maintenance patches. AutoPod incorporates a
system status service to determine when operating system patches need to be applied
to the current host, then automatically migrates application services to another host
to preserve their availability while the current host is updated and rebooted. AutoPod’s ability to migrate processes across kernel versions also increases *Pod’s value
by making it possible for users to move their *Pod between machines that are not
running the exact same kernel version.
1.3
6
Isolating Cooperating Processes
AutoPod envisions virtual computer usage growing rapidly as users create and use
many task-specific virtual computers, as is already occurring with the rise of virtual
appliances. But more computers mean more targets for malicious attackers, making
it even more important to keep them secure. Operating system virtualization, as
in a pod, provides namespaces that isolate processes from the host, enabling a level
of least-privilege isolation as single services are constrained to independent pods.
Today’s services, however, are complex applications with many distinct components.
Even within a pod, each component of the service has access to all resources required
by every component within the system, which is not a true least-privilege system.
To solve this problem, we developed PeaPod [115], which combines the pod with
a pea (Protection and Encapsulation Abstraction). As AutoPod demonstrates, pods
can be used to isolate services into separate virtual machine environments. The pea is
used within a pod to provide finer-grained isolation among application components of
a single service while still enabling them to interact. This allows services composed of
multiple distinct processes to be constructed more securely. PeaPod enables processes
to work together while limiting the resources each process can access to only those
needed to perform its job.
1.4
Managing Large Numbers of Machines
Although virtualization provides numerous benefits, such as minimizing the amount
of hardware to maintain by putting multiple virtual machines on a single physical
host, this can also make it harder for administrators to maintain an increased number of virtual machines. Just as the proliferation of virtual machines affects security,
7
it also significantly increases the administrative burden. Instead of managing a single machine providing a number of services, one manages many independent virtual
machines that each provide a single service. When security holes are discovered in
core operating system functionality, each virtual machine must be fixed separately.
This dissertation shows that operating system virtualization improves management of large systems. Although virtualization decreases the amount of physical
hardware to manage, it does not reduce, and can even increase, the number of machine instances to be managed. Strata improves this situation by introducing the
Virtual Layered File System (VLFS ). Instead of having independent file systems for
each service, the VLFS enables a file system to be divided into a set of shareable
layers and combined into a single file system namespace view. This enables many
machines to be stored efficiently because data that is common to more than one only
has to be stored once. It allows efficient provisioning because none of the shared files
have to be copied into place. Finally, it improves maintenance because the patched
layer only has to be installed once, and is then pulled into all the VLFSs that use
that layer.
1.5
A Desktop of Isolated Applications
Once we can manage multiple independent machines efficiently, we can use those
machines in novel ways. For instance, Apiary improves the ability to create secure
computer desktops. Apiary leverages Strata’s VLFS to contain each application in
an independent and isolated container. Even if one application is exploited, the
exploit will be confined to that application and the rest of the user’s data will remain
secure. Similarly, because VLFSs allow very quick provisioning, Apiary can run
desktop applications ephemerally in addition to their regular persistent execution
8
models. An ephemeral application is an application whose container is provisioned
anew for each execution of the application. Once the execution is complete, the
container is removed from the system. This means that even if an application executed
ephemerally is exploited, the exploit will not persist because the next ephemeral
execution will be within a fresh container. Finally, because independent applications
do not provide the integrated feel users expect from their desktops, Apiary enables
applications to integrate securely at specific points. Apiary improves on PeaPod for
desktop scenarios by enabling applications to be securely isolated without requiring
complicated access rules to be designed and written.
1.6
Two-Person Control Administration
Finally, we have leveraged operating system virtualization to provide high assurance
system administration. In a traditional operating system, the administrative user is
an all-powerful entity who can perform any task with no record of the changes made
on the system and no check on their power. This causes two problems. First, they
are able to subvert the security of the system with malicious intent. Second, changes
made by single users are prone to error.
ISE-T [111] changes this model by applying the concept of two-person control
to system administration. Two-person control changes system administration in two
ways. First, instead of performing administrative actions directly on the machine,
the changes are first performed on a sandbox that mirrors the machine being administrated. By providing two administrators with their own sandboxes to perform the
same administrative task, ISE-T can extract their changes, compare them for equivalence, and, if equivalent, commit them to the underlying machine. Second, in cases
where the two-person control system is too expensive, ISE-T can extract the changed
9
state and store it in a secure audit log for future verification before committing it to
the underlying system. This enables a high assurance system with little additional
administration cost.
1.7
Technical Contributions
This dissertation contributes multiple technical innovations and their associated architectures:
1. We introduce an operating system virtualization platform that provides secure
virtual machines without any underlying operating system changes. This is
necessary to enable multiple virtual environments to run in parallel on a single
machine as well as to enable secure execution of untrusted processes.
2. We introduce a portable storage-based computing environment. By combining
our secure operating system virtualization platform, a checkpoint/restart system, and portable storage devices, we created the *Pod architecture to migrate
a user’s processes between machines securely.
3. We introduce a checkpoint/restart mechanism to enable the migration of processes between machines running different kernels. This is accomplished by saving the checkpoint/restart state in a kernel-independent format so that it can
be adapted to the internal data structures of the kernel to which the processes
are being migrated. The AutoPod architecture improves system management
by allowing administrators to administrate machines without terminating processes. It also improves the utility of *Pod by not limiting users to machines
running the same kernel version.
10
4. We introduce the pea process isolation abstraction. Peas allow individual processes in a multi-process system to cooperate while contained in individual
resource-restricted compartments. The PeaPod architecture creates least-privilege
environments for the multiple processes that constitute services in use today.
5. We introduce the Virtual Layered File System (VLFS). The VLFS improves system administration by enabling system administrators to divide a file system
into distinct subset layers and use the layers for multiple simultaneous installations. The VLFS combines traditional package management with unioning file
systems in a new way, yielding powerful new functionality. The Strata architecture permits administrators to provision and manage large numbers of virtual
machines efficiently.
6. We introduce the concepts of a containerized desktop and ephemeral application
execution. In a containerized desktop, each desktop application is fully isolated
in its own container with its own file system. This prevents an exploited application from accessing data belonging to other applications. Ephemeral application
execution creates a single-use application container and file system for each individual application execution. Ephemeral containers prevent malicious data from
having any persistent effect on the system and isolate faults to a single application instance. The Apiary architecture provides a new way to secure desktop
applications by isolating each application within its own container, while letting the isolated applications interact in a secure manner through ephemeral
execution.
7. We introduce two-person control for system administration to create a high assurance form of system administration. This helps keep system administration
faults from impacting a system. We use the same mechanism to introduce au-
11
ditable system administration, increasing assurance with little additional cost.
The ISE-T architecture enables systems to be administrated within this twoperson control model.
Chapter 2
Overview of Operating System
Virtualization
To understand how operating system virtualization allows us to solve new software
problems without requiring the software to be rewritten, we first explain what operating system virtualization is and how it works. Many people are familiar with hardware
virtualization, where real operating systems run on virtual hardware, which is a virtualization layer between the host machine and the operating system. Operating system
virtualization differs from hardware virtualization in where it places the virtualization
layer. Instead of virtualizing the hardware interfaces, it virtualizes the operating system interfaces to provide virtualized views of the underlying host operating system.
Unlike hardware virtualization, where different operating systems can run in parallel, operating system virtualization is restricted to the same operating system as the
host. This dissertation explores the benefits of virtualizing the two primary operating
system elements that applications leverage: the kernel that provides the runtime, but
ephemeral, state of a process, and the file system that provides the long-term stable
storage on which processes depend.
Chapter 2. Overview of Operating System Virtualization
2.1
13
Operating System Kernel Virtualization
Applications depend heavily on kernel state during their runtime, from simple things
like process identifiers to more complicated states like inter-process communication
(IPC) keys, file descriptors and memory mappings. Some of these states already have
an element of virtualization that enables multiple processes to coexist on a single
system. For example, each process has its own file descriptor and virtual memory
namespaces. On the other hand, states such as process identifiers and IPC keys are
shared within a single namespace accessible to all processes. One primary use of
operating system kernel virtualization is to create multiple parallel namespaces that
are fully isolated from one another [7, 74, 116]. But this requires significant in-kernel
modifications.
Operating system kernel virtualization is commonly implemented by virtualizing
resource identifiers. Every resource that a virtualized process accesses has a virtual
identifier which corresponds to a physical operating system resource identifier. When
an operating system resource is created for a virtualized process, such as with IPC
key creation, the virtualization layer, instead of returning the corresponding physical
name to the process, intercepts the physical name value and returns a virtual name
to the process. Similarly, any time a virtualized process passes a virtual identifier
to the operating system, the virtualization layer intercepts it, replacing it with the
appropriate physical identifier.
This type of operating system kernel virtualization is easily implemented by system call interposition. System call interposition can create a virtualized namespace
because all operating system resources are accessed through system calls. By interposing on the system call, the virtualization abstraction can intercept the virtual
resource identifier the process passes in with the system call, and, if valid, replace
14
it with the correct physical resource identifier. Similarly, whenever a physical resource is created and has its identifier passed back to a process, the virtualization
abstraction can intercept the value and replace it with a newly created and mapped
virtual identifier. By virtualizing a process so that it can only access virtual named
resources, operating system virtualization decouples a process’s execution from the
underlying namespace of the host machine. Many commodity operating systems,
including Solaris [116] and Linux [6], now include this functionality natively.
Kernel virtualization is not limited to creating independent and isolated namespaces, but can also change how the kernel behaves. Instead of simply translating
resource identifiers, kernel virtualization can change how the system calls interact
with those identifiers. For instance, it can change the security semantics of system
calls. Many system calls have built-in security checks to decide whether a process has
permission to execute a specific functionality. Once the kernel is virtualized through
the system call interface, the virtualized system calls can allow a process to access a
resource it would have otherwise been prevented from accessing, or vice versa.
2.2
File System Virtualization
Kernel virtualization and system call interposition enable virtualization of the ephemeral
kernel state, but each process also uses the file system, which provides the processes
with persistent storage. By virtualizing the file system, we enable processes to have a
file system view that is independent of the host machine’s file system. For instance,
when creating multiple parallel kernel namespaces, one often intends to provide virtual machine environments. To do this, one must also provide a private file system
namespace for each environment. If private file system namespaces are inadvertently
omitted, the file system is shared and isolation is severely weakened.
15
In fact, commodity operating systems offer the ability to virtualize the file system
in exactly this way, such as by leveraging the chroot ability, enabling a process to
be confined to a subset of the underlying machine’s file system. But because current
commodity operating systems are not built to support multiple namespaces, we must
address the security issues this causes. Although chroot can provide processes within
a pod a virtualized file system namespace, there are many ways to break out of the
standard chrooted environment, especially if one allows the chroot system call to be
used by processes within the virtualized file system environment [58].
To provide secure file system virtualization, the virtualization mechanism must
enforce the chrooted environment’s limitation at all times. We have implemented a
barrier directory in the underlying file system that prevents processes within a pod
from crossing it. Even if a process is able to break the chrooted virtualized file system
view, the process will never be able to access any files outside the virtualized area.
To enforce a barrier, we interpose on the file system’s ->permission method, which
determines if a process can access a file or directory. For example, if a process tries to
access a file a few directories below the current directory, the permission function is
called on each directory in order as well as on the file itself. If a call determines that
the process does not have permission on that directory, the chain of calls ends, because
the process must have permission to traverse the directory hierarchy in order to access
the file. By interposing on the permission function, we can deny permission to access
the barrier directory to processes within a pod. The process cannot traverse the
barrier and so cannot access any file outside the virtualized file system environment.
However, file system virtualization is not limited to the creation of private file
system namespaces. Much as the barrier directory is implemented by interposing
on the file system’s permission function, one can also interpose on all functionality
the file system exposes to the operating system in order to create virtualized file
16
system instances. Just as pod virtualization allows differentiating virtual machine
environments without unique machine or operating system instances, file system virtualization permits differentiating each pod’s file system namespace in unique ways
without requiring each pod to have a unique physical file system. For instance, file
system virtualization enables pods to have unique file system security policies. It
can even create file system views totally independent of the underlying file system by
combining multiple individual file systems into a single view.
In fact, this is exactly how stackable file systems [124, 152] work. Stackable file
systems provide a completely virtual file system by interposing on the kernel’s file
system operations. Instead of interposing directly, as with system call virtualization,
stackable file systems create a virtual file system that the kernel uses as a regular file
system. But rather than having a data store on a block device of its own, it leverages
the data stored within other file systems. This enables stackable file systems to
interpose directly on the physical file system by leveraging the operating system’s file
system interface. Instead of executing the file system’s functions directly, including
using its directory entry and inode structures, file system virtualization interposes on
those functions and provides its own set of file system structures that map onto those
of the underlying physical file system.
By interposing between the kernel and physical file systems, stackable file systems
allow easy creation of virtual file systems. The virtual file system is then able to
modify operations as appropriate for the needs of the system. For example, a unioning
semantic can be implemented with a stackable file system that combines multiple
underlying physical directories into a single view by interposing on the ->readdir
method. Whenever a program calls the operation, the stackable file system creates a
virtualized view by running the operation against all the underlying directories that
are being unioned into a single view and returning the unioned set of data.
2.3
17
Related Work
Many different systems have been created to enable the virtualization of kernel states.
They can be loosely grouped into four categories:
Operating system provided virtualization. This is most notable in operating
systems that provide alternate namespaces for the creation of containers, including
Solaris’s Zones [116], Linux’s Vserver [7] and Containers [6], and BSD’s Jail mode [74].
Systems in this category are the least flexible, as their techniques are tightly coupled
to the underlying system. This prevents them from being leveraged to solve problems
for which they were not explicitly designed.
Direct interposition on system calls. This enables code to directly intercept
the system call within the kernel. The kernel does not call the built-in system call’s
function, but instead executes the function provided by the virtualization layer, which
in turn calls the built-in one if needed. This very old technique, common in MS-DOS,
was used in Terminate and Stay Resident (TSR) programs [99]. In more modern
usage, Zap [100] implements its virtualization by interposing directly on a set of
system calls that it desires to virtualize, as well as by providing a generic interface
to enable other virtualization layers to interpose on whatever system call they desire.
The architectures in this dissertation use this approach.
Kernel-based system call trace and trapping. This is most notably provided
by the ptrace system call [144], which provides tracing and debugging facilities that
enable one process to completely control the execution of another. For instance,
a controlling process can be notified whenever the controlled process attempts to
execute a system call. Instead of letting the system call run directly, the controlling
process chooses to allow or disallow the system call, to change the parameters being
passed to the system call, or even to cause a totally separate code path to be executed.
18
This is a very flexible approach because while the interposition is being enforced by the
kernel via the ptrace system call, it runs as a regular user space program. However,
due to the the many context switches between the user space program using the
ptrace system call and the kernel, performance suffers.
User space-based system call trace and trapping. Instead of trapping in
the kernel, one can provide a user space library that provides its own system call
wrapper function [1]. Well-behaved programs do not execute system calls directly,
but call a library function that wraps the system call, enabling the system call to be
virtualized by replacing that library function with one that enforces the virtualization
of the kernel state. But this only works for well-behaved applications and cannot be
used to enforce security schemes, as any application can execute system calls directly
and avoid the library’s interposition mechanism.
File system virtualization: Operating systems’ file system interfaces have also
been virtualized in multiple ways. Modern operating systems provide a Virtual File
System (VFS) interface [73]. This enables different types of file systems to be used
with the operating system in a manner transparent to all applications. In addition,
modern operating systems support network file system shares using protocols such as
NFS [135] and SMB [151]. These network file systems provide virtualized access to
a remote file system while enabling applications to treat its contents as though they
were stored locally.
A common way to create virtualized file system access is through stackable file
systems. For example, Plan 9 [104] offered the 9P distributed file system protocol [105]
to enable the creation of virtual file systems. HURD [35] and Spring [78] also included
extensible file system interfaces. More commonly today, the NFS protocol serves as
the basis for other file systems that virtualize and extend the Unix file system via the
SFS Toolkit [89]. It exposes the NFS interface to user space programs, allowing them
19
to provide file system functionality safely. But the NFS protocol is very complicated.
User space file systems that depend on it must fully understand it to be implemented
correctly.
The more usual approach is to leverage kernel functionality to create these virtualized file systems. This is generally easier to implement than an NFS-based approach
because the kernel’s file system interface is simpler than the one the NFS protocol
exposes. This can be implemented via a user space file system such as FUSE [137]
that provides the necessary kernel hooks. Alternatively, the entire file system can be
built as an in-kernel file system that can be dynamically loaded and unloaded, as in
FiST [152], which behaves as a native file system. In general, the in-kernel approach
yields significantly better performance because fewer context switches are needed.
Kernel-based virtualized file systems are known as stackable file systems and have
been implemented in many different operating systems [97, 124, 130, 152].
Chapter 3
*Pod: Improving User Mobility
A key problem mobile users face is the lack of a common environment as they switch
locations. The computer at the office is configured differently from the one at home,
which is again different from the one at the library. Even though mobile users have
massive computer power at each location, they cannot easily take advantage of it.
These locations can have different sets of software installed, which can make it difficult
for a user to complete a task. Moreover, mobile users want consistent access to their
files, which is difficult to guarantee as they move around. The current personal
computer framework ties a user’s data to a single machine.
Laptops are a common solution in an attempt to solve the problems posed by mobility. Laptops enable users to carry their data and applications with them wherever
they go. But laptops only mask the problem, as they do not leverage the existing
infrastructure and suffer from a number of difficulties of their own. First, laptops
are not as full-featured as a desktop computer. They have less storage and smaller
physical features like keyboards and displays. They are slower because cooling and
space constraints prevent the fastest processors from being used in a laptop. Even
laptops considered to be Desktop Replacements have speed limitations, tend to be
Chapter 3. *Pod: Improving User Mobility
21
as heavy as 8 or 9 pounds, and are not meant to be extremely mobile. Second, because laptops use small, specialized, and moving parts, they are more fault-prone.
This manifests itself in moving parts like a fan or hard disk breaking down, or in an
internal connection coming loose, as when memory is unseated from its socket.
To address these mobility and reliability problems posed by laptops, we have
designed and built the *Pod architecture. *Pod leverages operating system virtualization to enable the creation of application-specific portable devices that decouple
a user’s application environment from any one physical machine. Depending on the
mobile user’s needs, the *Pod architecture lets users carry a single application or a
large set of applications, as well as large sets of data.
For instance, many users do most of their computing work through a web browser.
They read email through webmail interfaces, interact with friends on social networking websites, and even use word processors and spreadsheets without leaving the web
browser. But while a web browser is available on every Internet-connected computer,
it will not necessarily be configured according to their needs. For instance, helper applications, browser plugins, bookmarks and cookies will not move with them between
machines. For these users, we leveraged the *Pod architecture to create a WebPod
device that contains a web browser, plugins and helper applications needed within
the web environment.
Many mobile users, however, require a more full-featured computing environment.
They do not want to store all their data on the Internet, nor to be limited to the
applications available via the web. They expect the traditional desktop experience.
Although they have access to powerful computers at many locations, these computers
are not configured correctly for their work. For these users, we leveraged the *Pod
architecture to create a DeskPod device containing all of the desktop applications a
user requires, along with their data.
22
The *Pod architecture enables mobile users to obtain the same persistent, personalized computing experience at any computer. *Pod takes advantage of commodity
storage devices that can easily fit in a user’s pocket yet store large amounts of data.
These devices range from flash memory sticks that can hold 64 GB of data to portable
hard disks, such as an Apple iPod, that can hold 120 GB of data. These devices can
hold a user’s entire computing environment, including applications and all their data.
The *Pod architecture allows a user to decouple their computing session from
the underlying computer, so that it can be suspended to a portable storage device,
carried around easily, and resumed from the storage device on a completely different
computer. Users have ubiquitous access to computing power, at work, home, school,
library or even an Internet cafe, and the *Pod architecture enables them to continue
working, even in the face of faulty components, simply by moving their *Pod-based
environment to a new host machine. *Pod provides this functionality without modifying, recompiling or relinking any applications or the operating system kernel, and
with only a negligible impact on performance.
The *Pod architecture does have limitations, as shown by our MediaPod and
GamePod devices. These devices enable users to carry with them a multimedia
player and a game playing environment, respectively. Although they are very flexible
in what media formats and games they can play, they do not provide any computing
capabilities of their own. Moreover, although they allow users to move their environment among computers, they do not let them make use of the environment on
the go, when they have no access to a computer. This is in contrast to devices such
as Apple’s iPod and Nintendo’s Game Boy and DS portable devices, which can only
play a limited amount of formats, but provide their own computing ability and are
therefore usable on the go, without a powerful computer. These devices are popular
with users on the move, so MediaPod and GamePod are less likely to replace them.
3.1
23
*Pod Architecture
*Pod operates by encapsulating a user’s computing session within a virtualized execution environment and storing all states associated with the session on the portable
storage device. *Pod also leverages THINC [27] to virtualize the display so that
the application session can be scaled to different display resolutions as a user moves
among computers. This enables a computing session to run the same way on any
host despite different operating system environments and display hardware. These
virtualization mechanisms enable *Pod to isolate and protect the host from untrusted
applications that a user may run as part of their *Pod session. *Pod virtualization
also prevents other applications outside of the computing session that may be running
on the host from accessing any of the session’s data, protecting the user’s privacy.
We have combined *Pod’s virtualization with Zap’s checkpoint/restart mechanism [100], allowing users to suspend the entire computing session to the portable
storage device so that it can be migrated between physical computers by simply moving the storage device to a new computer and resuming the session there. *Pod
preserves on the portable device the file system states and process execution states
associated with the computing session. A limitation of this approach is that Zap
only supports homogeneous migration, so it can migrate only between machines running the exact same kernel. In Chapter 4, we demonstrate heterogeneous migration,
thereby removing this limitation.
As a result, *Pod enables users to maintain a common environment, no matter
what computer they are using. Devices built upon the *Pod architecture are also less
prone to problems because they do not contain a complete operating system, only the
programs needed for one specific application environment. Various operating system
services that a normal machine depends on are not needed, so maintenance is simpler.
24
To the user, a *Pod-based device appears no different than a private computer, even
though it runs on a host that may be running other applications. Those applications
run outside the session provided by the *Pod device and are not visible to a user
within the *Pod session. To provide strong security, the *Pod can store the session
on an encrypted file system. If the *Pod device is lost or stolen, an attacker will only
be able to use it as his own personal storage device.
3.1.1
Secure Operating System Virtualization
In order to enable a *Pod device to be used on computers that are not controlled by
the *Pod user, we must securely isolate the *Pod device from the underlying machine.
Previous operating system virtualization techniques either are not designed to provide
secure isolation and therefore do not protect the host machine from rogue processes
running within the device’s context, or require significant changes to the underlying
operating system.
For example, pods as introduced by Zap [100] provide a level of isolation and enable
multiple pods to coexist on a single system, but they were not designed to be secure.
Zap’s virtualization operates by providing each environment with its own virtual
private namespace. A pod contains its own host-independent view of operating system
resources such as PID/GID, IPC, memory, file system and devices. The namespace
is the only means for the processes to access the underlying operating system. Zap
introduces this namespace to decouple processes from the host’s operating system.
But Zap assumes that the person using the pod already has privileged access to the
machine, and therefore is not directly concerned with a user breaking out of the
abstraction. Without protecting the host, no one would allow *Pod devices to use
their systems. Therefore, we leverage operating system virtualization at both the
25
kernel and file system levels to create the secure pod abstraction, enabling untrusted
*Pod devices to be used securely.
Protecting the host from rogue processes requires a complete virtualization abstraction that totally confines the process and prevents it from breaking the abstraction and effecting change to the host machine. The secure pod abstraction achieves
this in two ways. First, it prevents processes within it from accessing any file system
outside of the *Pod device. Second, while it lets processes in the *Pod device context
run with privilege, it prevents any privileged action that can break the abstraction.
Many previous operating system virtualization architectures relied on the chroot
functionality to provide a private file system namespace in which processes run. While
chroot can give a set of processes a virtualized file system namespace, there are many
ways to break out of the standard chrooted environment, especially if one allows the
chroot system call to be used by the virtualized processes. To prevent this, the
secure pod abstraction virtualizes the file system interface and implements a barrier,
thereby enforcing the chrooted environment even while allowing the chroot system
call.
We can implement a barrier easily because file systems provide a ->permission
method that determines if a process can access a file. For example, if a process
tries to access a file a few directories below the current directory, the file system’s
->permission method is called on each directory as well as the file itself, in order. If
any call determines that the process does not have permission on a directory, the chain
of calls ends. Even if the ->permission method were to determine that the process
has access to the file itself, it must have permission to traverse the directory hierarchy
to reach the file. We implemented a barrier simply by stacking a small virtual file
system on top of the staging directory that virtualized the underlying ->permission
method to prevent the virtualized processes from accessing the parent directory of the
26
staging directory. This effectively confines *Pod processes to the *Pod’s file system
by preventing a rogue process from ever walking past the *Pod’s file system root.
The secure pod abstraction also takes advantage of the user identifier (UID) security model in traditional file systems to support multiple security domains on the
same system running on the same operating system kernel. For example, since each
secure pod has its own private file system, it has its own /etc/passwd file that determines its list of users and their corresponding UIDs. In traditional Unix file systems,
the UID of a process determines what permissions it has when accessing a file. This
means that since the *Pod’s file system is separate from the host file system, a *Pod
process is effectively running in a separate security domain from another process with
the same UID that is running directly on the host system. Although both processes
have the same UID, the *Pod process is only allowed to access files in its own file system namespace. Similarly, this model allows multiple secure pods on a single system
to contain independent processes running with the same UID.
This UID model supports an easy-to-use migration model when a user may be
using a *Pod device on a host in one administrative domain and then moves the *Pod
device to another. Even if the user has computer accounts in both administrative
domains, it is unlikely that the user will have the same UID in both domains if they
are administratively separate. Nevertheless, the secure pod abstraction enables the
user to run the same *Pod device with access to the same files in both domains.
Suppose the user has UID 100 on a machine in administrative domain A and starts
a pod connecting to a file server residing in domain A. Suppose that all virtualized
processes are then running with UID 100. When the user moves to a machine in
administrative domain B where they have UID 200, they can migrate their *Pod
device to the new machine and continue running its processes. Those processes can
continue to run as UID 100 and continue to access the same set of files on the *Pod
27
file system, even though the user’s real UID has changed. This works even if there is a
regular user on the new machine with a UID of 100. Whereas this example considers
the case of a *Pod device with all processes running with the same UID, it is easy to
see that the secure pod abstraction supports running processes with many different
UIDs.
However, this only works for regular processes, however, because they do not have
special privileges. But because the root UID 0 is privileged and treated specially by
the operating system kernel, the secure pod virtualization abstraction treats UID 0
processes within a secure pod specially as well. We must do this to prevent processes
with privilege from breaking the virtualization abstraction, accessing resources on the
host, and harming it. The secure pod abstraction does not disallow UID 0 processes,
as this would limit the range of application services that can be virtualized. Instead,
it restricts such processes to ensure that they function correctly when virtualized.
While a process is running in user space, its UID does not have any effect on
process execution. Its UID only matters when it tries to access the underlying kernel
via one of the kernel entry points, namely devices and system calls. Since the secure
pod abstraction already provides a virtual file system that includes a virtual /dev with
a limited set of secure devices, the device entry point is already secured. Furthermore,
the secure pod abstraction disallows device nodes on the *Pod device’s file system.
The only system calls of concern are those that could allow a root process to break the
virtualization abstraction. Only a small number of system calls can be used for this
purpose. These system calls are listed and described in further detail in Appendix A.
Secure pod virtualization classifies these system calls into three classes.
The first class of system calls are those affecting only the host system and serving
no purpose within a virtualized context. Examples of these system calls include those
that load and unload kernel modules (create module, delete module) or that reboot
28
the host system (reboot). Since they only affect the host, they would break the secure
pod abstraction by allowing processes within it to make administrative changes to
the host. System calls that are part of this class are therefore made inaccessible by
default to virtualized processes.
The second class of system calls are those forced to run unprivileged. Just as
NFS, by default, squashes root on a client machine to act as user nobody, secure pod
virtualization forces privileged processes to act as the nobody user when they execute
some system calls. Examples of these system calls include those that set resource
limits and ioctl system calls. Because some system calls, such as setrlimit and
nice, can allow a privileged process to increase its resource limits beyond predefined
limits imposed on virtualized processes, privileged virtualized processes are by default treated as unprivileged when executing these system calls. Similarly, the ioctl
system call is a multiplexer that effectively allows any driver on the host to install its
own set of system calls. It is impossible to audit the large set of system calls, given
that a *Pod device may be used on a wide range of machine configurations, so we
conservatively treat access to this system call as unprivileged by default.
The final class of system calls are those that are required for regular applications
to run, but that have options that will give the processes access to the underlying
host resources, breaking the isolation provided by the secure pod abstraction. Since
these system calls are required by applications, the secure pod virtualization checks all
their options to ensure that they are limited to resources to which the *Pod device has
access, making sure they do not break the secure pod abstraction. For example, the
mknod system call can be used by privileged processes to make named pipes or files in
certain application services. It is therefore desirable to make it available to virtualized
processes. But it can also be used to create device nodes that provide access to the
underlying host resources. The secure pod’s kernel virtualization mechanism checks
29
the options of the system call and only allows it to continue if it is not trying to create
a device.
3.2
Using a *Pod Device
A user starts a *Pod device simply by plugging it in to a computer. The computer
detects the device and automatically tries to restart the *Pod session. The user
is asked for a password. Authentication can also be done without a password by
using built-in fingerprint readers available on some USB drives [11]. Once a user is
authorized, the *Pod device mounts its file system, restarts its desktop computing
session, and attaches a *Pod viewer to the session, making the associated set of
applications available and visible to the user. Applications running in a *Pod session
appear to the underlying operating system just like other applications that may be
running on the host machine, and they use the host’s network interface in the same
manner.
Once the *Pod is started, the user can use the applications available in the computing environment. When the user wants to leave the computer, they simply close
the *Pod viewer. The *Pod session is quickly checkpointed to the *Pod storage device, which can then be unplugged and carried around by the user. When the user is
ready to use another computer, they simply plug in the *Pod device and the session
restarts exactly where it was suspended. With a *Pod-based device, the user does not
need to manually launch applications and reload documents. The *Pod’s integrated
checkpoint/restart functionality maintains a user’s computing session persistently as
a user moves from one computer to another, even including ephemeral states such
as copy/paste buffers. If the host machine crashes, it takes down the current *Pod
session with it. But since *Pod devices do not provide their own operating system,
30
one can simply plug it into a new host machine and start a fresh *Pod session. The
only data lost is that not committed to disk when the host machine crashes. In addition, the *Pod device’s file system is automatically backed up when connected to the
user’s primary computer. This enables quick recovery if the device is lost. The user
can replicate the file system on a new device and continue working.
3.3
Experimental Results
We have implemented four *Pod devices: WebPod [113], DeskPod [114], MediaPod [109] and GamePod [110]. Each *Pod device contains three components: a
simple viewer application for accessing the *Pod session, an unmodified XFree86 4.3
display server with THINC’s virtual display device driver, and a loadable kernel module in Linux that requires no changes to the Linux kernel. The kernel module provides
the secure pod’s operating system virtualization layer and Zap’s process migration
mechanism. We present experimental results using our Linux prototype to quantify
the overhead of using the *Pod device on various applications.
Experiments were conducted on two IBM PC machines, each with a 933 MHz
Intel Pentium-III CPU and 512 MB RAM. The machines each had a 100 Mbps NIC
and were connected to one another via 100 Mbps Ethernet and a 3Com Superstack
II 3900 switch. Two machines were used as hosts for running the *Pod device and
the other was used as a web server for measuring web benchmark performance. To
demonstrate *Pod’s ability to operate across different operating system distributions,
each machine was configured with a different Linux distribution. The machines ran
both Debian 3.0 (“Woody”) and 3.1 (“Sarge”) with a Linux 2.4.18 kernel.
We used a 40 GB Apple iPod as the *Pod portable storage device, although a
much smaller USB memory drive would have sufficed. Both PCs used FireWire to
WebPod
163 MB
Size
GamePod Deskpod
283 MB
418 MB
31
MediaPod
633 MB
Table 3.1 – Per-Device *Pod File System Sizes
Name
getpid
ioctl
semaphore
fork-exit
fork-sh
iBench
Description
average getpid runtime
average runtime for the FIONREAD ioctl
IPC Semaphore variable is created and removed
process forks and waits for child which calls exit immediately
process forks and waits for child to run /bin/sh to run a program that prints “hello world” then exits
Measures the average time it takes to load a set of web pages
Linux
350 ns
427 ns
1370 ns
44.7 µs
3.89 ms
826 ms
Table 3.2 – Benchmark Descriptions
connect to the iPod. We built an unoptimized *Pod file system by bootstrapping a
Debian GNU/Linux installation onto the iPod and installing the appropriate set of
applications. In all cases, we included a simple KDE 2.2.2 environment. WebPod
additionally included the Konqueror 2.2.2 web browser. GamePod included Quake
2, Tetris and Solitaire. DeskPod added on top of WebPod the entire KDE Office
Suite, with all the desktop applications a user needs. Finally, MediaPod added on
top of DeskPod multiple media-related applications, including video, DVD and music
players, with their related codecs. We removed the extra packages needed to boot a
full Linux system, as *Pod is just a lightweight application environment, not a full
operating system. As can be seen in Table 3.1, the various *Pod devices we built
all have minimal storage requirements, enabling them to be stored on many portable
devices with ease. In addition, our unoptimized *Pod file systems could be even
smaller if the file system were built from scratch instead of by installing programs
and libraries as needed.
To measure the cost of *Pod’s virtualization, we took a range of benchmarks that
represent various operations that occur in a normal application environment and
measured their performance on both our Linux *Pod prototype and a vanilla Linux
1.6
Plain
*Pod
1.4
Normalized Performance
32
1.2
1.0
0.8
0.6
0.4
0.2
get
pid
ioct
l
sem
fork
exit
aph
ore
fork
sh
iBe
nch
Figure 3.1 – *Pod Virtualization Overhead
system. We used a set of micro-benchmarks that represent operations executed by real
applications as well as a real web browsing application benchmark. Table 3.2 shows
the 6 benchmarks we used along with their performance on a vanilla Linux system in
which all benchmarks were run from a local disk. These benchmarks were then run for
comparison purposes in the *Pod portable storage environment. To obtain accurate,
repeatable results, we rebooted the system between measurements. Additionally, the
system call micro-benchmarks directly used the TSC register available on Pentium
CPUs to record timestamps at the significant measurement events. Each timestamp’s
average cost was 58 ns. The files for the benchmarks were stored on the *Pod’s file
system. All of these benchmarks were performed in a *Pod environment running
on the PC machine running Debian Unstable with a Linux 2.4.18 kernel. Figure
3.1 shows the results of running our benchmarks under both configurations, with
the vanilla Linux configuration normalized to 1. A smaller number is better for all
33
benchmark results.
Figure 3.1 shows that *Pod virtualization overhead is small. *Pod incurs less
than 10% overhead for most of the micro-benchmarks and less than 4% overhead for
the iBench application workload. The overhead for the simple system call getpid
benchmark is only 7% compared to vanilla Linux, reflecting the fact that *Pod virtualization for these kinds of system calls only requires an extra procedure call and a
hash table lookup. The most expensive benchmark for *Pod is semget+semctl, which
took 51% longer than vanilla Linux. The cost reflects the fact that our untuned *Pod
prototype needs to allocate memory and do a number of namespace translations.
Kernel semaphores are widely used by web browsers such as Mozilla and Konqueror
to perform synchronization. The ioctl benchmark also has high overhead because
of the 12 separate assignments it does to protect the call against malicious processes.
This is large compared to the simple FIONREAD ioctl that just performs a simple
dereference. But because the ioctl is simple, it only adds 200 ns of overhead over any
ioctl. There is a minimal overhead for functions such as fork and the fork/exec
combination. This is indicative of what happens when the web browser loads a plugin
such as Adobe Acrobat, where the web browser runs the acroread program in the
background.
Figure 3.1 shows that *Pod has low virtualization overhead for real applications
as well as micro-benchmarks. This is illustrated by the performance on the iBench
benchmark, which is a modified version of the Web Text Page Load test from the
Ziff-Davis iBench 1.5 benchmark suite. It consists of a JavaScript-controlled load
of a set of web pages from the web benchmark server. iBench also uses JavaScript
to measure how long it takes to download and process each web page, then determines the average download time per page. The pages contain both text and bitmap
graphics, with pages varying in the proportions of text and graphics. The graphics
WebPod DeskPod
100.0
34
MediaPod
GamePod
Checkpoint
Restart
Plain
Time (s)
10.0
1.0
0.1
1B
1
D
T
O
m
S
T
Q
row 0 Bro eskto otem gle pg12 olitareetris uake
3
ser wse
p
rs
Figure 3.2 – *Pod Checkpoint/Restart vs. Normal Startup Latency
are embedded images in GIF and JPEG formats. Our results show that running the
iBench benchmark in the *Pod environment incurs no performance overhead versus
running in vanilla Linux from local SCSI storage.
To measure the cost of checkpointing and restarting *Pod sessions, as well as
demonstrate *Pod’s ability to improve the way a user works with various applications, we migrated multiple *Pod sessions containing different sets of applications.
For WebPod, we migrated multiple sessions containing different numbers of open
browser windows between the two machines described above. For DeskPod, we migrated a session containing the KWrite word processor, the KSpread spreadsheet and
the Konqueror web browser, each displaying a document, in addition to a Konsole
terminal application. This is indicative of a regular desktop environment. For MediaPod, we migrated multiple sessions containing different sets of running desktop and
multimedia applications: first, a MediaPod using the Totem media player playing an
35
XviD encoded version of a DVD; second, a MediaPod using Ogle playing a straight
DVD image copied to the MediaPod; third, a MediaPod playing an mp3 file using
the mpg123 program.
Figure 3.2 shows how long it takes to checkpoint to disk and warm cache restart the
multiple *Pod sessions described above. We compared this to how long it would take
to warm cache startup each session independently. Figure 3.2 shows that, in general,
it is significantly faster to checkpoint and restart *Pod sessions than it is to start the
same kind of session from scratch. Checkpointing and restarting a *Pod, even with
many browser windows opened, takes under a second. A *Pod user can disconnect,
plug in to another machine, and start using their session again very quickly. Many
tasks have a large startup time, such as Ogle, which iterates through all the files on the
DVD image to determine if they have to be decrypted and calculates the decryption
key. Furthermore, these experiments were run across two different machines with two
different operating system environments, demonstrating that *Pod can indeed work
across different software environments.
In contrast, Figure 3.2 shows that starting the applications the traditional way is
much slower in all cases. For instance, starting a browsing session takes 12 seconds
when opening the browser windows with actual web content. Even starting a web
browsing session by opening a single browser window takes more than a second. Even
in the mpg123 case, where it appears that the mpg123 application starts faster than
*Pod can restart it, this is because it is not a direct comparison. For plain startup,
all we are doing is restarting the small 136KB mpg123 application, while for *Pod
restart, we are restarting the entire KDE desktop environment as well. It should
be noted that *Pod’s approach to restarting applications is fundamentally different
than plain restarting, as *Pod returns the application sessions to where they were
executing when they were suspended. For instance, a restarted MediaPod session
WebPod
1 Web 10 Web
Size 25 mb
46 mb
DeskPod
50 mb
Totem
44 mb
MediaPod
Ogle mpg123
27 mb
17 mb
36
Solit.
44 mb
GamePod
Tetris Quake
22 mb
50 mb
Table 3.3 – *Pod Checkpoint Sizes
will continue playing the file from where it was, while a WebPod session will show the
web browser’s content, even if the content on the web has changed in the meantime.
Similarly, restarting a *Pod device requires restarting all applications associated with
it, including the desktop environment, as opposed to starting a plain application that
uses the desktop environment that is already running.
Table 3.3 shows the amount of storage needed to store the checkpointed sessions
using *Pod for each of the *Pod devices and sessions described. The results reported
show checkpointed image sizes without applying any compression techniques to reduce
the image size. These results show that the checkpointed state that needs to be saved
is very modest and easy to store on any portable storage device. Given the modest
size of the checkpointed images, there is no need for any additional compression,
which would reduce the minimal storage demands, but add additional latency due
to the need to compress and decompress the checkpointed images. The checkpointed
image size in all cases was 50 MB or less.
3.4
Related Work
*Pod builds upon our previous work on MobiDesk [26], which provides a hosted desktop infrastructure that improves management by enabling the desktop sessions to be
migrated between the back-end infrastructure machines. *Pod differs from MobiDesk
in two fundamental ways. First, it builds upon MobiDesk by coupling its compute session migration with portable storage to improve users’ mobility. Second, MobiDesk is
limited to a single administrative domain. Unlike *Pod devices, which can be moved
37
between machines managed by different users and organizations, MobiDesk sessions
can only exist within a single organization and therefore do not require the secure
operating system virtualization abstraction.
Given the ubiquity of web browsers on modern computers, many traditional applications are becoming web-enabled, allowing the mobile user to use them from any
computer. Common applications such as email [2, 5], instant messaging [20], and
even word processing and spreadsheet applications [3] have been ported to a web
services environment that is usable within a simple web browser. The advantage of
this approach is that users effectively store their data on centrally managed servers
accessible from any networked computer.
But even the web user relies on various applications, such as Adobe Acrobat
Reader, to be available on whatever computer they are using at the moment. If the
application is already installed on the host, the web browser can use it, but otherwise,
the user is unable to complete the task at hand. Some web-based applications have
been created to fill these gaps, such as one that converts PDF files to simple image files
viewable from any web browser. These approaches, however, are application-specific
and often quite limited. For instance, converting PDF files to simple image files cuts
out useful features of the native application, such as the ability to search the PDF.
Similarly, items like cookies and bookmarks allow a user to work more efficiently, but
do not travel with the user as they move between web browsers on different machines.
Another solution that solves many of the above problems is the use of thin-client
computing [14, 27, 51]. The thin-client approach provides several significant advantages over traditional desktop computing. Clients can be essentially stateless appliances that do not need to be backed up or restored, require almost no maintenance
or upgrades, and do not store any sensitive data that can be lost or stolen. Server
resources can be physically secured in protected data centers and centrally adminis-
38
tered, with all the attendant benefits of easier maintenance and cheaper upgrades.
Computing resources can be consolidated and shared across many users, resulting
in more effective utilization of computing hardware. Moreover, the ability of thin
clients to decouple display from application execution over a network offers a myriad of other benefits, including graphical remote access to a persistent session from
anywhere, screen sharing for remote collaboration, and instant technical support.
A number of solutions resembling the thin-client approach have sprung up in the
past. The model has come and gone many times, however, whether in mainframe
dumb terminals, X terminals, or network computers, without being able to displace
the desktop computer. No matter how fast the network connection is, the connection
between the computer and the local video device will be significantly faster. For example, one would need gigabit ethernet to transfer a decoded DVD across the network,
while even 10-year-old PCs have enough video bandwidth to do this. Although gigabit ethernet is becoming more common today, we are transitioning to high-definition
video streams which require many times more bandwidth. Similarly, many applications, especially 3D-oriented ones, need to transfer large amounts of data quickly, and
have been shown to use as much bandwidth as possible, as they are what is pushing
the state of the art in hardware graphic devices.
The emergence of cheap, portable storage devices has led to the development of
web browsers for USB drives, including Stealth Surfer [10] and Portable Firefox [8].
These approaches only provide the ability to run a web browser on a USB drive.
Unlike *Pod, they do not provide a complete application environment. The various
programs and plugins that make the user’s experience more comfortable do not work
within this environment. The U3 platform [12] has attempted to provide a standard
way to enable applications to store data and launch applications. But it has not gained
any traction in the marketplace and, unlike *Pod, does not address mobile users’ need
39
for persistent application sessions that can be easily moved between locations.
Systems like SoulPad [37] and the Collective [41] provide a solution similar to
*Pod, but are based on using a bootable Linux distribution like Knoppix and VMware [142]
on a USB drive. For these systems, Knoppix provides a Linux operating system that
can boot from a USB drive for certain hardware platforms. VMware provides a virtual machine monitor (VMM) that enables an entire operating system environment
and its applications to be suspended and resumed from disk. They are designed to
take over the host computer they are plugged into by booting their own operating
system. They then launch a VMware VM that runs the migratable operating system environment. Unlike *Pod, they do not rely on any software installed on the
host. However, they require minutes to start up given the need to boot and configure
an entire operating system for the specific host being used. *Pod does not need to
provide an entire operating system instance for the virtual machine to run, and so
is much more lightweight. *Pod requires less storage, so it can operate on smaller
USB drives, and does not require rebooting the host into another operating system,
so it starts up much faster. However, unlike these systems, *Pod is limited to the
same operating system interface as the host machine, and requires a secure operating
system virtualization layer to be written for every operating system it is to be used
with.
Moka5 [94] attempts to optimize the management and distribution of these portable
hardware virtual machine-based devices by storing the virtual machine on the network
and only requiring the user to carry a small cache storage device. This cache storage
device provides a base host operating system and the ability to page in the necessary
parts of the virtual machine on demand. Whereas today’s storage devices can easily
hold the entire virtual machine, this cache architecture improves management. It
allows the virtual machines to be upgraded on the server by a central administrator,
40
with updates pulled into the cache when the machine is rebooted.
In general, providing virtualization and checkpoint/restart capabilities using a
VMM such as VMware represents an interesting alternative to the *Pod operating
system virtualization approach. VMMs virtualize the underlying machine hardware
while *Pod virtualizes the operating system. VMMs can checkpoint and restart an
entire operating system environment. However, unlike *Pod, VMMs cannot checkpoint and restart applications without also checkpointing and restarting the operating system. *Pod virtualization operates at a finer granularity than virtual machine
approaches by virtualizing individual sessions instead of complete operating system
environments. Using VMMs can be more space- and time-intensive because the operating system must be included on the portable storage device.
Chapter 4
AutoPod: Reducing Downtime for
System Maintenance
A key problem many organizations face is keeping their computer services available
while the underlying machines are maintained. These services run on increasingly
networked computers, which are frequent targets of attacks that attempt to exploit
vulnerable software they could be running. To prevent these attacks from succeeding, software vendors frequently release patches to address security and maintenance
issues. But for these patches to be effective, they must be applied to the machines.
System maintenance, however, commonly results in a system service being unavailable. For example, patching an operating system may mean that the whole
system is down for a length of time. If system administrators fix an operating system security problem immediately, they risks upsetting their users because of loss of
data. If the underlying hardware has to be replaced, the machine will have to be shut
down. The system administrators must schedule downtime in advance and in cooperation with users, leaving the computer vulnerable until repaired. If the operating
system is patched successfully, downtime may be limited to just a few minutes during
Chapter 4. AutoPod: Reducing Downtime for System Maintenance
42
the reboot. Even then, users incur additional inconvenience and delays in starting
applications again and attempting to restore their sessions. If the patch is not successful, downtime can extend for many hours while the problem is diagnosed and solved.
Downtime due to security and maintenance problems is costly as well as inconvenient.
Therefore, it is not uncommon for systems to continue running unpatched software
long after a security exploit is well-known [123].
To address these problems, we have designed and built AutoPod, a system that
provides an easy-to-use autonomic infrastructure [77] for operating system self-maintenance. AutoPod is unique because it enables unscheduled operating system updates
of commodity operating systems while preserving application service availability during system maintenance. AutoPod functions without modifying, recompiling, or relinking applications or operating system kernels. We have done this by combining
three key mechanisms: a lightweight operating system virtualization isolation abstraction that can be used at the level of individual applications, a checkpoint/restart
mechanism that operates across operating system versions with different security and
maintenance patches, and an autonomic system status service that monitors the system for system faults and security updates.
AutoPod combines *Pod’s secure pod abstraction with a novel checkpoint/restart
mechanism that uniquely decouples processes from the underlying system and maintains process state semantics, allowing processes to migrate across different machines
with different operating system versions. The checkpoint/restart mechanism introduces a platform-independent intermediate format for saving the states associated
with processes and AutoPod virtualization. AutoPod combines this format with
higher-level functions for saving and restoring process states to yield a degree of portability impossible with previous approaches. This checkpoint/restart mechanism relies
on the same kind of operating system semantics that allow applications to function
43
correctly across operating system versions with different security and maintenance
patches.
AutoPod combines these mechanisms with an autonomous system status service.
The service monitors the system for faults and security updates. When the service
detects new security updates, it downloads and installs them automatically. If the
update requires a reboot, the service uses AutoPod’s checkpoint/restart capability to
save the AutoPod’s state, reboot the machine into the newly repaired environment,
and restart the processes within the AutoPod without data loss. This permits fast
recovery from downtime even when other machines are not available to run application services. Alternatively, if another machine is available, the AutoPod can be
migrated to the new machine while the original machine is maintained and rebooted,
further reducing application service downtime. This allows security patches to be
applied to operating systems in a timely manner with minimal impact on application
service availability. Once the original machine is updated, applications can continue
to execute even though the underlying operating system has changed. Similarly, if
the service detects an imminent system fault, AutoPod can checkpoint the processes,
migrate, and restart them on a new machine before the fault causes their execution
to fail.
4.1
AutoPod Architecture
The AutoPod architecture is based on *Pod’s secure pod abstraction. As shows in
Figure 4.1, AutoPod permits server consolidation by allowing multiple pods to run on
a single machine while enabling automatic machine status monitoring. As each pod
provides a complete secure virtual machine abstraction, it is able to run any server application that would run on a regular machine. By consolidating multiple machines
Pod B
AutoPod Virtualization Layer
AutoPod System Monitor
Pod A
44
Host Operating System
Host Hardware
Figure 4.1 – AutoPod Model
into distinct pods running on a single server, the administrator has fewer physical
hardware and operating system instances to manage. Similarly, when kernel security
holes are discovered, server consolidation minimizes the number of machines to be upgraded and rebooted. The AutoPod system monitor further improves manageability
by constantly monitoring the host system for stability and security problems.
By leveraging the secure pod abstraction, AutoPod is able to securely isolate
multiple independent services running on a single machine. Operating system virtualization restricts what operating system resources are accessible to processes within
it simply by not providing identifiers to certain resources within its namespace. An
AutoPod can then be constructed to provide access only to resources needed for its
service. An administrator configures the AutoPod in the same way a regular machine
is configured and installs applications within it. The secure pod abstraction en-
45
forces secure isolation to prevent exploited services from attacking the host or other
services on it. Similarly, secure isolation allows running multiple services from different organizations, with different sets of users and administrators on a single host,
while retaining the semantic of multiple distinct and individually managed machines.
Multiple services that previously ran on multiple machines can now run on a single
machine.
For example, a web server pod is easily configured to contain only the files the web
server needs to run and the content it is to serve. The web server pod could have its
own IP address, decoupling its network presence from that of the underlying system.
Using a firewall, the pod’s network access is limited to client-initiated connections.
Connections to the pod’s IP address are limited to the ports served by the application
running within this pod. If multiple isolated web servers are required, multiple pods
can be set up on a single machine. If one web server application is compromised, its
pod limits further harm to the system, because the only resources the compromised
pod can access are those explicitly needed by its service. Because this web server
pod does not need to initiate connections to other hosts, it is easy to firewall it
to prevent it from directly initiating connections to other systems. This limits an
attacker’s ability to use the exploited service as a launching point for other attacks.
Furthermore, there is no need to disable other network services commonly enabled by
the operating system to guard against the compromised pod because those services,
and the operating system itself, reside outside the pod’s context.
4.2
Migration Across Different Kernels
AutoPod complements the secure pod virtualization abstraction with a cross-kernel
checkpoint/restart system that improves the mobility of services within a data cen-
46
ter. Checkpoint/restart provides the glue that permits a pod to checkpoint services,
migrate the state to a new machine, and restart them across other computers with
different hardware and operating system kernels. AutoPod’s migration is limited to
machines with a common CPU architecture, and that run “compatible” operating
systems. Compatibility is determined by the extent to which they differ in their
API and internal semantics. Minor versions are normally limited to maintenance
and security patches, without affecting the kernel’s API. Major versions carry significant changes, like modifying the application’s execution semantics or introducing
new functionality, that may break application compatibility. Nevertheless, they are
usually backward compatible. For instance, the Linux kernel has two major versions,
2.4 and 2.6, with over 30 minor versions each. Linux 2.6 significantly differs from
2.4 in how threads behave, and also introduces various new system calls. This implies that migration across minor versions is generally not restricted, but migration
between major versions is only feasible from older to newer.
To support migration across different kernels, AutoPod’s checkpoint/restart mechanism employs three key design principles: storing operating system state in adequate
abstract representation, converting between the abstract representation and operating system-specific state using specialized filters, and using well-established native
kernel interfaces to access and alter the state.
AutoPod’s checkpoint/restart mechanism relies on an intermediate abstract format to represent the state to be saved. While the low-level details maintained by
the operating system may change radically between different kernels, the high-level
properties are unlikely to change since they reflect the actual semantics upon which
the application relies. AutoPod describes the state of a process in terms of this
higher-level semantic information rather than kernel-specific data. To illustrate this,
consider the data that describe inter-process relationships, e.g., parent, child, siblings
47
and threads. The operating system normally optimizes for speed by keeping multiple
data structures to reflect these relationships. But this format has limited portability
across different kernels; in Linux, the exact technique did indeed change between 2.4
and 2.6. Instead, AutoPod uses a tree structure to capture a high-level representation
of the relationships, mirroring its semantics. The same holds for other resources, e.g.,
communication sockets, pipes, open files and system timers. AutoPod extracts the
relevant state the way it is encapsulated in the operating system’s API, rather than
in the details of its implementation. Doing so maximizes portability across kernel
versions by adopting properties that are considered highly stable.
To accommodate differences in semantics that inevitably occur between kernel
versions, AutoPod uses specialized conversion filters. The checkpointed state data
is saved and restored as a stream. The conversion filters manipulate the contents
of this stream. Although typically they are designed to translate between different
representations, they can be used to perform other operations such as compression
and encryption. Their main advantages are extreme flexibility and being executed
like regular helper applications. Building on the example above, because the thread
model changes between Linux 2.4 and 2.6, a filter can easily be designed to make the
former abstract data adhere to the new semantics. Additional filters can be built if
semantic changes occur in the future. This is a very robust and powerful solution.
AutoPod leverages high-level native kernel services in order to transform the intermediate representation of the checkpointed image into the complete internal state
required by the target kernel during restart. Continuing with the previous example,
AutoPod restores the structure of the process tree by exploiting the native fork system call. According to the abstract process tree data, a sequence of fork calls is
issued to replicate the original relationships. This avoids dealing with any internal
kernel details. Moreover, high-level primitives of this sort remain virtually unchanged
48
across minor or major kernel changes. Finally, these services are available for use by
loadable kernel modules, enabling AutoPod to perform cross-kernel migration without
requiring modifications to the kernel.
To eliminate possible dependencies on low-level kernel details, AutoPod’s checkpoint/restart mechanism requires processes to be suspended before being checkpointed. Suspending processes creates a quiescent state necessary to guarantee the
correctness of the checkpointed image, and substantially reduces the amount of information that needs to be saved by avoiding transient data. For example, consider
a checkpoint started while one of the processes is executing the exit system call. It
would take tremendous effort and detail to ensure a proper and consistent capture of
such a transient state. Instead, by first suspending all processes, such ongoing activities are either completed or interrupted. AutoPod uses this property to guarantee a
consistent and static state during the checkpoint.
Finally, we must ensure that changes in system call interfaces are properly handled.
AutoPod has a virtualization layer that employs system call interposition to maintain
namespace consistency. It follows that a change in the semantics for any system call
intercepted could raise an issue in migrating across such differences. Fortunately,
such changes are rare, and when they occur, they are hidden by standard libraries
from the application level lest they break the applications. Consequently, AutoPod is
protected just as legacy applications are protected. On the other hand, the addition
of new system calls to the kernel requires that the encapsulation be extended to
support them. Moreover, it restricts migration back to older versions. For instance, an
application that invokes the new waitid system call in Linux 2.6 cannot be migrated
back to 2.4 unless an emulation layer exists there.
AutoPod uses two techniques to save and restore device-specific states, depending
on the device class. Some devices provide standard interfaces for applications to read
49
and set their state. Common sound cards and the Intel MMX processor extensions
are two notable examples. With these, it is possible to easily inquire the device with
regard to state prior to checkpoint, and reestablish it during restart.
However, many device drivers maintain internal state that is practically inaccessible from the outside. AutoPod ensures that processes within its session only have
access to such devices through the virtual device drivers provided by the AutoPod
device. This makes it simple to checkpoint the device-specific data associated with
the processes. For instance, the AutoPod display system is built using its own virtual
display device driver which is not tied to any specific hardware device, and keeps its
entire state in regular memory. As a result, its state can be readily checkpointed
as a simple matter of saving that process similarly to others. After the process is
restarted, the AutoPod viewer on the host reconnects to the virtual display driver to
display the complete session.
4.3
Autonomic System Status Service
AutoPod provides a generic autonomic framework for managing system state. The
framework can monitor multiple sources for information and use this information to
make autonomic decisions about when to checkpoint pods, migrate them to other
machines, and restart them. Although there are many items that can be monitored,
our service monitors two in particular. First, it monitors the vendor’s software security
update repository to ensure that the system stays up to date with the latest security
patches. Second, it monitors the underlying hardware of the system to ensure that an
imminent fault is detected before the fault occurs and corrupts application state. By
monitoring these two sets of information, the autonomic system status service is able
to reboot or shut down the computer while checkpointing or migrating the processes.
50
This helps to ensure that data is not lost or corrupted because of a forced reboot or
hardware fault propagating into the running processes.
Many operating system vendors enable users to automatically check for and install system updates. Example of these include Microsoft’s Windows Update service
and Debian’s security repositories. These updates are guaranteed genuine through
cryptographic signed hashes that verify that the contents come from the vendors.
But some of these updates require reboots. In the case of Debian GNU/Linux, this
is limited to kernel upgrades. We provide a simple service that monitors security
update repositories. The autonomic service downloads all security updates and uses
AutoPod’s checkpoint/restart mechanism to enable the updates that need reboots
without disrupting running applications and causing them to lose state.
Commodity systems also provide information about the current state of the system
that can indicate if the system has an imminent failure on its hands. Subsystems, such
as a hard disk’s Self-Monitoring Analysis Reporting Technology (S.M.A.R.T.) [46] let
an autonomic service monitor the system’s hardware state. S.M.A.R.T. provides diagnostic information, such as temperature and read/write error rates, on the hard drives
in the system that can indicate if the hard disk is nearing failure. Many commodity
computer motherboards also have the ability to measure CPU and case temperature,
as well as the speeds of the fans that regulate those temperatures. If temperature in
the machine rises too high, hardware in the machine can fail catastrophically. Similarly, if the fans fail and stop spinning, the temperature will likely rise out of control.
Our autonomic service monitors these sensors. If it detects an imminent failure, it
will attempt to migrate a pod to a cooler system, and shut down the machine to
prevent the hardware from being destroyed.
Many administrators use an interruptible power supply to avoid data loss or corruption during a power loss. Although one can shut down a computer when the
51
battery backup runs low, most applications are not written to save their data in the
presence of a forced shutdown. AutoPod, on the other hand, monitors UPS status.
If the battery backup becomes low, it can quickly checkpoint the pod’s state to avoid
any data loss when the computer is forced to shut down.
Similarly, the operating system kernel on the machine monitors the state of the
system, and if irregularities occur, such as DMA timeouts or resetting the IDE bus, it
logs them. Our autonomic service monitors the kernel logs to discover these irregular
conditions. When the hardware monitoring systems or the kernel logs provide information about possible pending system failures, the autonomic service checkpoints the
pods running on the system and migrates them to a new system to be restarted. This
ensures that state is not lost and informs administrators that maintenance is needed.
Many policies can be implemented to determine to which system a pod should be
migrated when a machine needs maintenance. Our autonomic service allows a pod to
be migrated within a specified set of clustered machines. The autonomic service gets
reports at regular intervals from the other machines’ autonomic services that report
each machine’s load. If the autonomic service decides that it must migrate a pod, it
chooses the machine in its cluster with the lightest load.
4.4
AutoPod Examples
We give two brief examples to illustrate how AutoPod can be used to improve application availability for system services such as email delivery and desktop computing.
In both cases we describe the architecture of the system and show how it can be
run within AutoPod, enabling administrators to reduce downtime in the face of machine maintenance. We also discuss how a system administrator can set up and use
AutoPod.
4.4.1
52
System Services
Administrators like to run many services on a single machine. By doing this, they are
able to benefit from improved machine utilization, but this gives each service access
to many resources not needed to perform their job. A classic example of this is email
delivery. Email delivery services such as Exim and Sendmail are often run on the
same system as other Internet services to improve resource utilization and simplify
system administration through server consolidation. But these services, Sendmail
in particular, have been exploited many times because they have access to system
resources, such as a shell program, that they do not need to perform their job.
For email delivery, AutoPod can isolate email delivery to provide a significantly
higher level of security in light of the many attacks on mail transfer agents. Consider
isolating an Exim service installation, the default Debian mail transfer agent. Using
AutoPod, Exim can execute in a resource-restricted pod that isolates email delivery
from other services on the system. Since AutoPod allows migrating a service between
machines, the email delivery pod is migratable. If a fault is discovered in the underlying host machine, the email delivery service can be moved to another system while
the original host is patched, keeping the email service available.
With this email delivery example, a simple system configuration can prevent the
common buffer overflow exploit of getting the privileged server to execute a local shell.
By simply removing shells from within the Exim pod, we are limiting the amateur
attacker’s ability to exploit flaws, while requiring very little additional knowledge
about how to configure the service. AutoPod can further automatically monitor
system status and checkpoint the Exim pod if a fault is detected to ensure that no
data is lost or corrupted. Similarly, in the event that a machine has to be rebooted,
the service can automatically be migrated to a new machine to avoid downtime.
53
A common problem system administrators face is that forced machine downtime,
e.g., for reboots, can make a service unavailable. A usual way to avoid this is to
throw multiple machines at the problem. By providing the service through a cluster
of machines, system administrators can upgrade the individual machines in a rolling
manner. This enables system administrators to upgrade the systems while keeping
the service available. But more machines increase management complexity and cost.
AutoPod, in conjunction with hardware virtual machine monitors, improves this
situation immensely. Using a virtual machine monitor to provide two virtual machines on a single host, AutoPod can then run a pod within a virtual machine to
enable a single node maintenance scenario that decreases costs as well as management complexity. During regular operation, all application services run within the
pod on one virtual machine. To upgrade the operating system on the running virtual
machine, bring the second virtual machine online and migrate the pod to the new
virtual machine. Once the initial virtual machine is upgraded and rebooted, migrate
the pod back to it. Only one physical machine is needed, reducing costs. Only one
virtual machine is in use for the majority of the time, reducing management complexity. Because AutoPod runs unmodified applications, any application service that
can be installed can take advantage of its general single node maintenance.
4.4.2
Desktop Computing
As personal computers have become ubiquitous in large corporate, government, and
academic organizations, the cost of owning and maintaining them is growing unmanageable. These computers are increasingly networked, which only complicates
matters. They must be constantly patched and upgraded to protect them and their
data from the myriad of viruses and other attacks commonplace on today’s networks.
54
To solve this problem, many organizations have turned to thin-client solutions such
as Microsoft’s Windows Terminal Services and Sun’s Sun Ray. Thin clients allow administrators to centralize many of their administrative duties because only a single
computer or cluster of computers needs to be maintained in a central location, while
stateless client devices are used to access users’ desktop computing environments. Although thin-client solutions lower some administrative costs, this comes at the loss of
semantics that users normally expect from a private desktop. For instance, users who
use their own private desktop expect to be isolated from their coworkers. However,
in a shared thin-client environment, users share the same machine. There may be
many shared files, and a user’s computing behavior can impact the performance of
other users on the system.
Although a thin-client environment minimizes the number of machines, the centralized servers still need to be administered, and since they are more heavily utilized,
management becomes more difficult. For instance, on a private system, one only has
to schedule system maintenance with a single user. However, in a thin-client environment, one has to schedule maintenance with all the users on the system to avoid
data loss.
AutoPod enables system administrators to solve these problems by allowing each
user to run a desktop session within a pod. Instead of users sharing a single file
system, AutoPod provides each pod with three file systems: a shared read-only file
system of all the regular system files users expect in their desktop environments,
a private writeable file system for a user’s persistent data, and a private writeable
file system for a user’s temporary data. By sharing common system files, AutoPod
provides centralization benefits that simplify system administration. By providing
private writeable file systems for each pod, AutoPod provides each user with privacy
benefits similar to a private machine.
55
Coupling pod virtualization and isolation mechanisms with a migration mechanism can provide scalable computing resources for the desktop and improve desktop
availability. If a user needs access to more computing resources, for instance while doing complex mathematical computations, AutoPod can migrate that user’s session to
a more powerful machine. If maintenance needs to be done on a host machine, AutoPod can migrate the desktop sessions to other machines without scheduling downtime
and without forcibly terminating any programs users are running.
4.4.3
Setting Up and Using AutoPod
To demonstrate how simple it is to set up a pod to run within the AutoPod environment, we provide a step-by-step walkthrough on how one would create a new
pod that can run the Exim mail transfer agent. Setting up AutoPod to provide the
Exim pod on Linux is straightforward and leverages the same skill set and experience
system administrators already have on standard Linux systems. AutoPod is started
by loading its kernel module into a Linux system and using its user-level utilities to
set up and insert processes into a pod.
Creating a pod’s file system is the same as creating a chroot environment. Administrators with experience creating a minimal environment containing only the
application they want to isolate do not need to do any extra work. However, many
administrators do not have experience creating such an environment and therefore
need an easy way to create an environment in which to run their application. These
administrators can take advantage of Debian’s debootstrap utility that allows a user
to quickly set up an environment equivalent to a base Debian installation. An administrator would do a debootstrap stable /autopod to install the most recently
released Debian system into the /autopod directory. While this also includes many
56
packages that are not required by the installation, it provides a small base to work
from.
To configure Exim, an administrator edits the appropriate configuration files
within the /autopod/etc/exim4/ directory. To run Exim in a pod, an administrator does mount -o bind /autopod /autopod/exim/root to loopback-mount the
pod directory onto the staging area directory, where the pod expects it to be. autopod
add exim is used to create a new pod named exim which uses /autopod/exim/root
as the root for its file system. Finally, autopod addproc exim /usr/sbin/exim4
is used to start Exim within the pod by executing the /usr/sbin/exim4 program,
which is located at /autopod/exim/root/usr/sbin/exim4.
AutoPod isolates the processes running within a pod from the rest of the system,
which helps contain intrusions if they occur. But since a pod does not have to be
maintained by itself, but can be maintained in the context of a larger system, one can
also prune down the environment and remove many programs that an attacker could
use against the system. For instance, if an Exim pod does not need to run shell scripts,
there is no reason to leave programs such as /bin/bash, /bin/sh, and /bin/dash
within the environment. But these programs will be necessary in the future if the
administrator wants to upgrade the package using normal Debian methods. Because
it is easy to recreate the environment, one approach would be to remove all the
programs that are not wanted within the environment and recreate the environment
when an upgrade is needed. Another would be to move those programs outside the
pod, perhaps by creating a /autopod-backup directory. To upgrade the pod using
normal Debian methods, the programs can be moved back into the pod’s file system.
If an administrator wants to manually reboot the system without killing the processes within the Exim pod, they can first checkpoint the pod to disk by running
autopod checkpoint exim -o /exim.ck, which tells AutoPod to checkpoint the
57
processes associated with the Exim pod to the file /exim.ck. The system can then be
rebooted, potentially with an updated kernel. Once it comes back up, the pod can be
restarted from the /exim.ck file by running autopod restart exim -i /exim.ck.
These mechanisms are the same as those used by the AutoPod system status service
for controlling the checkpointing and migration of pods.
Standard Debian facilities can be used for running other services within a pod.
Once the base environment is set up, an administrator can chroot into this environment to continue setup. By editing the /etc/apt/sources.list file appropriately
and running apt-get update, an administrator will be able to install any Debian
package into the pod. In the Exim example, Exim does not need to be installed
since it is the default mail transfer agent (MTA) and is already included in the base
Debian installation. If one wanted to install another MTA, such as Sendmail, one
could run apt-get install sendmail, which will download Sendmail and all the
packages needed to run it. This will work for any service available within Debian.
An administrator can also use the dpkg --purge option to remove packages that are
not required by a given pod. For instance, in running an Apache web server in a pod,
one can remove the default Exim mail transfer agent because Apache does not need
it.
4.5
We implemented AutoPod as a loadable kernel module in Linux, which requires no
changes to the kernel, as well as a user space system status monitoring service. We
present some experimental results using our Linux prototype to quantify the overhead
of using AutoPod on various applications. Experiments were conducted on three IBM
Netfinity 4500R machines, each with a 933Mhz Intel Pentium-III CPU, 512MB RAM,
Name
Applications
Email
Web
Exim 3.36
Apache 1.3.26 and MySQL 4.0.14
Xvnc – VNC 3.3.3r2 X Server
KDE – Entire KDE 2.2.2 environment, including window manager,
panel and assorted background daemons and utilities
SSH – openssh 3.4p1 client inside a KDE Konsole terminal conDesktop nected to a remote host
Shell – The Bash 2.05a shell running in a Konsole terminal
KGhostView – A PDF viewer with a 450KB 16-page PDF file
loaded
Konqueror – A modern standards-compliant web browser that is
part of KDE
KOffice – The KDE word processor and spreadsheet programs
58
Normal
Startup
504 ms
2.1 s
19 s
Table 4.1 – Application Scenarios
9.1 GB SCSI HD, and 100 Mbps Ethernet connected to a 3Com Superstack II 3900
switch. One of the machines was used as an NFS server from which directories were
mounted to construct the virtual file system for the pod on the other client systems.
One client ran Debian Stable with a Linux 2.4.5 kernel, and the other ran Debian
Unstable with a Linux 2.4.18 kernel.
To measure the cost of AutoPod migration and demonstrate the ability of AutoPod to migrate real applications, we migrated three application scenarios: an email
delivery service using Exim and Procmail, a web content delivery service using Apache
and MySQL, and a KDE desktop computing environment. Table 4.1 describes the
configurations of the application scenarios we migrated and shows the time it takes
to start up on a regular Linux system. To demonstrate our AutoPod prototype’s
ability to migrate across Linux kernels with different minor versions, we checkpointed
each application workload on the 2.4.5 kernel client machine and restarted it on the
2.4.18 kernel machine. For these experiments, the workloads were checkpointed to
and restarted from a local disk.
Case
Email
Web
Desktop
Checkpoint
11 ms
308 ms
851 ms
Restart
14 ms
47 ms
942 ms
Size
284 KB
5.3 MB
35 MB
59
Compressed
84 KB
332 KB
8.8 MB
Table 4.2 – AutoPod Migration Costs
Table 4.2 shows the time to checkpoint and restart each application workload. Migration time also has to take into account network transfer time. As this is dependent
on the transport medium, we include the uncompressed and compressed checkpoint
image sizes. In all cases, checkpoint and restart times were significantly faster than
the regular startup times listed in Table 4.1, taking less than a second for both operations, even when performed on separate machines or across a reboot. Moreover,
a number of techniques have since been pioneered to further minimize downtime, including pre-copy/incremental checkpointing [43,81,141] and intelligent quiescing [81].
Pre-copy/incremental checkpointing minimizes the amount of time the services will
be unavailable by taking partial checkpoints during the service’s execution and only
saving what has changed since the last checkpoint was taken. Intelligent quiescing
minimizes the time checkpointing takes by keeping the services available until the
entire service is ready to checkpoint.
We also show that the actual checkpoint images saved were modestly sized for
complex workloads. For example, the Desktop pod had over 30 different processes
running, including the KDE desktop applications, substantial underlying window
system infrastructure, inter-application sharing, and a rich desktop interface managed
by a window manager. Even with all these applications running, they checkpoint to
a very reasonable 35 MB uncompressed for a full desktop environment. Additionally,
if checkpoint images must be transferred over a slow link, Table 4.2 shows that they
can be compressed very well with bzip2.
4.6
60
Related Work
Virtual machine monitors (VMMs) have been used to provide secure isolation [28,
142, 147] and to migrate an entire operating system environment [128]. Unlike AutoPod, standard VMMs decouple processes from the underlying machine hardware, but
tie them to an instance of an operating system. As a result, VMMs cannot migrate
processes apart from that operating system instance and cannot continue running
those processes if the operating system instance goes down, such as during security
upgrades. In contrast, AutoPod decouples process execution from the underlying
operating system, allowing it to migrate processes to another system when an operating system instance is upgraded. VMMs have been proposed to support online
maintenance of systems [87] by having a microvisor that supports at most two virtual
machines running on the machine at the same time, effectively giving each physical
machine the ability to act as its own hot spare. This proposal, however, explicitly
depends on AutoPod’s heterogeneous migration without providing this functionality
itself.
Many systems have been proposed to support process migration [22, 24, 40, 42,
54, 85, 95, 106, 119, 120, 125, 129], but they do not allow migration across independent
machines running different operating system versions. TUI [131] provides support for
process migration across machines running different operating systems and hardware
architectures. Unlike AutoPod, TUI has to compile applications on each platform
using a special compiler and does not work with unmodified legacy applications. AutoPod builds on Zap [100] to support transparent migration across systems running
the same kernel version. AutoPod goes beyond Zap in providing transparent migration across minor kernel versions, which is essential for making applications available
during operating system security upgrades.
61
Replication in clustered systems can provide the ability to do rolling upgrades. By
leveraging many nodes, individual nodes can be taken down for maintenance without
significantly impacting the load that the cluster can handle. For example, web content
is commonly delivered by multiple web servers behind a front end manager. This
front end manager enables an administrator to bring down back end web servers
for maintenance by directing requests only to the active web servers. This simple
solution is effective because it is easy to replicate web servers to serve the same
content. Although this model works fine for web server loads, as the individual jobs
are very short, it does not work for long-running jobs, such as a user’s desktop. In
the web server case, replication and upgrades are easy to do because only one web
server is used to serve any individual request and any web server can be used to serve
any request. For long-running stateful applications, such as a user’s desktop, requests
cannot be arbitrarily redirected to any desktop computing environment because each
user’s desktop session is unique. Although specialized hardware support could be
used to keep replicas synchronized by having all of them process all operations, this is
prohibitively expensive for most workloads and does not address how to resynchronize
the replicas in the presence of rolling upgrades.
Another possible solution is allowing the kernel to be hot-pluggable. Although
micro-kernels are not prevalent, they are able to upgrade their parts on the fly. More
commonly, many modern monolithic kernels have kernel modules that can be inserted
and removed dynamically. This can allow upgrading parts of a monolithic kernel
without requiring reboots. The Nooks [136] system extends this concept by enabling
kernel drivers and other kernel functionality, such as file systems, to be isolated in
their own domains to help isolate faults in kernel code and provide a more reliable
system. However, in all of these cases, there is still a base kernel on the machine that
cannot be replaced without a reboot. If that part must be replaced, all data is lost.
62
The K42 operating system can be dynamically updated [29], enabling software
patches to be applied to a running kernel even in the presence of data structure
changes. But it requires a completely new operating system design and does not
work with any commodity operating system. Even on K42, it is not yet possible to
upgrade the kernel while running realistic application workloads.
Chapter 5
PeaPod: Isolating Cooperating
Processes
A key problem faced by today’s computers is that they are difficult to secure due to
the numerous complex services they run. If a single service is exploited, an attacker is
able to access all the resources available to the machine it is running on. To prevent
this from occurring, it is important to design systems with security principles [126] in
mind to limit the damage that can occur when security is breached. One of the most
important principles is ensuring that one operates in a Least-Privilege environment.
Least-Privilege environments require that a user or a program has access only to
the resources that are required to complete their job. Even if the user’s or service’s
environment is exploited, the attacker will be constrained. For a system with many
distinct users and uses, designing a least-privilege system can prove to be very difficult,
as many independent application systems can be used in many different and unknown
ways.
A common approach to providing least-privilege environments is to separate each
individual service into its own sandbox container environment, such as provided by
Chapter 5. PeaPod: Isolating Cooperating Processes
64
AutoPod. Many sandbox container environments have been developed to isolate
untrusted applications [7, 60, 65, 74, 86, 118, 144]. However, many of these approaches
have suffered from being too complex and too difficult to configure to use in practice,
and have often been limited by an inability to work seamlessly with existing system
tools and applications. Virtual machine monitors (VMMs) offer a more attractive
approach by providing a much easier-to-use isolation model of virtual machines, which
look like separate and independent systems apart from the underlying host system.
However, because VMMs need to run an entire operating system instance in each
virtual machine, the granularity of isolation is very coarse, enabling malicious code
in a virtual machine to use the entire set of operating system resources. Multiple
operating instances also need to be maintained, adding administrative overhead.
A primary problem with a sandbox container that attempts to isolate a single
service is that many services are composed of many interdependent and cooperating
programs. Each individual application that makes up the service has its own set
of access requirements. However, since they all run within the same sandbox container, each individual application ends up with access to the superset of resources
that are needed by all the programs that make up the service, thereby negating the
least-privilege principle. One cannot divide the programs into distinct sandbox container environments since many programs are interdependent and expect to work
from within a single context.
We leveraged operating system virtualization to design and build PeaPod to enable the ability to sandbox complete services, while also enabling its interdependent
and cooperating components to be restricted into least-privilege environments. PeaPod combines two key virtualization abstractions in its virtualization layer. First, it
leverages the secure pod abstraction to provide a sandbox container for entire services
to run within. Second, it introduces the pea (Protection and Encapsulation Abstrac-
65
tion). A pea is an easy-to-use least-privilege mechanism that enables further isolation
among application components that need to share limited system resources within a
single pod. It can prevent compromised application components from attacking other
components within the same pod. A pea provides a simple resource-based model that
restricts access to other processes, IPC, file system and network resources available
to the pod as a whole.
PeaPod improves upon previous approaches by not requiring any operating system
modifications, as well as avoiding the time of check, time of use (TOCTOU) race
conditions that affect many of them [145]. For instance, unlike other approaches
that perform file system security checks at the system call level and therefore do not
check the actual file system object that the operating system uses, PeaPod leverages
file system virtualization to integrate directly into the kernel’s file system security
framework. PeaPod is designed to avoid the time of check, time of use race conditions
that affect previous approaches by performing all file system security checks within
the regular file system security paths and on the same file system objects that the
kernel itself uses.
5.1
PeaPod Model
The PeaPod model combines the previously introduced operating system virtualization secure pod abstraction with a new abstraction called peas. The secure pod
abstraction, as shown in AutoPod, is useful for separating distinct application services into separate machine environments. Peas are used in a pod to provide finegrained isolation among application components that may need to interact within a
single machine environment, such as using interprocess communication mechanisms,
including signals, shared memory, IPC messages and semaphores, and process forking
66
Figure 5.1 – PeaPod Model
and execution. Figure 5.1 shows how pods and peas work together. Each pod, and
the resources contained with it, is fully independent from each other pods, but each
pod can each have an arbitrary number of peas associated with them to apply extra
security restrictions amongst their cooperating processes.
A pea is an abstraction that can contain a group of processes, restrict those processes in interacting with processes outside of the pea, and limit their access to only
a subset of system resources. Unlike the secure pod abstraction, which achieves isolation by controlling what resources are located within the namespace, a pea achieves
isolation levels by controlling what system resources within a namespace its processes
are allowed to access and interact with. For example, a process in a pea can see file
system resources and processes available to other peas within a single pod, but can
be restricted from accessing them. Unlike processes in separate pods, processes in
separate peas in a single pod share the same namespace and can be allowed to inter-
67
act using traditional interprocess communication mechanisms. Processes can also be
allowed to move between peas in the same pod. However, by default, a processes in a
pea cannot access any resource that is not made available to that pea, be it a process
pid, IPC key or file system entry.
Peas can support a wide range of resource restriction policies. By default, processes contained in a pea can only interact with other processes in the same pea.
They have no access to other resources, such as file system and network resources or
processes outside of the pea. This provides a set of fail safe defaults, as any extra
access has to be explicitly allowed by the administrator.
The pea abstraction allows for processes running on the same system to have
varying levels of isolation by running in separate peas. Many peas can be used side
by side to provide flexibility in implementing a least-privilege system for programs
that are composed of multiple components that must work together, but do not all
need the same level of privilege. One usage scenario would be to have a severely
resource limited pea in which a privileged process executes. The process is, howerver,
allowed to use traditional Unix semantics to work with less privileged programs that
are in less resource restricted peas.
For example, peas can be used to allow a web server appliance the ability to serve
dynamic content via CGI in a more secure manner. Since the web server and the CGI
scripts need separate levels of privilege, and have different resource requirements, they
should not have to run within the same security context. By configuring two separate
peas for a web service, one for the web server to run within, and a separate one for the
specific CGI programs it wants to execute, one limits the damage that can occur if a
fault is discovered within the web server. If one manages to execute malicious code
within the context of the web server, one can only use resources that are allocated to
the web server’s pea, as well as only execute the specific programs that are needed
68
as CGIs. Since the CGI programs will also only run within their specific security
context, the ability for malicious code to do harm is severely limited.
Peas and pods together provide secure isolation based on flexible resource restriction for programs as opposed to restricting access based on users. Peas and pods
also do not subvert underlying system restrictions based on user permissions, but
instead complement such models by offering additional resource control based on the
environment in which a program is executed. Instead of allowing programs with root
privileges to do anything they want to a system, PeaPod allows a system to control
the execution of such programs to limit their ability to harm a system even if they
are compromised.
5.2
PeaPod Virtualization
To support the PeaPod virtualization abstraction design of secure and isolated namespaces on commodity operating systems, we leveraged the secure pod virtualization
architecture described in Chapter 3.1.1. For example, if one had a web server that
just serves static content, one can easily set up a web server pod to contain only the
files the web server needs to run and the content it wants to serve. The web server
pod could have its own IP address, decoupling its network presence from the underlying system. It could also limit network access to client-initiated connections. If the
web server application gets compromised, the pod limits the ability of an attacker
to further harm the system since the only resources the attacker has access to are
the ones explicitly needed by the service. Furthermore, there is no need to carefully
disable other network services commonly enabled by the operating system that might
be compromised, as only the single service is running within the pod.
5.2.1
69
Pea Virtualization
Peas are supported using virtualization mechanisms that label resources and enforce
a simple set of configurable permission rules to impose levels of isolation among
processes running within a single pod. For example, when a process is created via the
fork() and clone() system calls, its process identifier is tagged with the identifier of
the pea in which it was created. Peas leverage the pod’s shadow pod process identifier
and also place it in the same pea as its parent process. A process’s ability to access
pod resources is then dictated by the set of access permissions rules associated with its
pea. Like pod virtualization, the key pea operating system virtualization mechanisms
are system call interposition and file system stacking.
Pea virtualization employs system call interposition to virtualize the kernel and
wrap existing system calls. Kernel virtualization enables peas to enforce restrictions
on process interactions by controlling access to process and IPC virtual identifiers.
Since each resource is labeled with the pea in which it was created, the kernel virtualization mechanism checks if the pea labels of the calling process and the resource
to be accessed are the same. When a process in one pea tries to send a signal to a
process in a separate pea by using the kill system call, the system returns an error
value of EPERM, as the process exists, but this process has no permission to signal
it. However, a parent process is able to use the wait system call to clean up a terminated child process’s state, even if that child process is running within a separate pea,
since wait does not modify a process by affecting its execution. This is analogous to
a regular user being able to list the metadata of a file, such as owner and permission
bits, even if the user has no permission to read from or write to the file.
When a new process is created, it executes in the pea security domain of its
parent. However, when the process executes a new program, the security domain of
70
the parent might not be the appropriate security domain to execute the new program
in. Therefore, one wants the ability to explicitly transition the process from one
pea security domain to another on new program execution. To support this, peas
provide a single type of pea transition rule that lets a pea determine how a process
can transition from its current pea to another. This transition rule is specified by
a program filename and pea identifier. A pea is able to have multiple pea access
transition rules of this type. The rule specifies that a process should be moved
into the pea specified by the pea identifier if it executes the program specified by the
given filename. This is useful when it is desirable to have that new program execution
occur in an environment with different resource restrictions. For example, an Apache
web server running in a pea may want to execute its CGI child processes in a pea
with different restrictions. Pea transitioning is supported by interposing on the exec
system call and transitioning peas if the process to be executed matches a pea access
transition rule for the current pea. Note that pea access transition rules are one-way
transitions that do not allow a process to return to its previous pea unless its new
pea explicitly provides for such a transition.
Kernel virtualization is used to control network access inside the pea. Peas provide
two networking access rule types. One allow processes in the pea to make outgoing
network connections on a pod’s virtual network adapters, while the other allows
processes in the pea to bind to specific ports on the adapter to receive incoming
connections. Pea network access rules can allow complete access to a pod network
adapter, or only allow access on a per-port basis. Since any network access occurs
through system calls, peas simply check the options of the networking system call,
such as bind and connect, to ensure that it is allowed to perform the specified action.
Pea virtualization employs a set of file system access rules and file system virtualization to provide each pea with its own permission set on top of the pod file
71
system. To provide a least-privilege environment, processes should not have access
to file system privileges they do not need. For example, while Sendmail has to write
to /var/spool/mqueue, it only has to read its configuration from /etc/mail and
should not need to have write permission on its configuration. To implement such a
least-privilege environment, peas allow files to be tagged with additional permissions
that overlay the respective underlying file permissions. File system permissions determine access rights based on the user identity of the process while pea file permission
rules determine access rights based on the pea context in which a process is executed.
Each pea file permission rule can selectively allow or deny the use of the underlying
read, write and execute permissions of a file on a per-pea basis. The underlying file
permission is always enforced, but pea permissions can further restrict whether the
underlying permission is allowed to take effect. The final permission is achieved by
performing a bitwise and operation on both the pea and file system permissions. For
example, if the pea permission rule allowed for read and execute, the permission set of
r-x would be triplicated to r-xr-xr-x for the three sets of Unix permissions and the
bitwise and operation would mask out any write permission that the underlying file
system allows. This prevents any process in the pea from opening the file to modify
it.
Enforcing on-disk labeling of every single file, such as supported through access
control lists provided by many modern file systems, is inflexible if a single underlying file system is going to be used for multiple disparate pods and peas. As each
pea in each pod can use the same files with different permission schemes, storing
the pea’s permission data on disk is not feasible. Instead, peas support the ability to dynamically label each file within a pod’s file system based on two simple
path-matching permission rules: path-specific permission rules and directory-default
permission rules. A path-specific permission matches an exact path on the file sys-
72
tem. For instance, if there is a path-specific permission for /home/user/file, only
that file will be matched with the appropriate permission set. On the other hand, if
there is a directory-default permission for the directory /home/user/, then any file
under that directory in the directory tree can match it, and inherit its permission set.
Given a set of path-specific and directory-default permissions for a pea, the algorithm for determining what permission matches to what path starts with the complete
path and walks up the path to the root directory until it finds a matching permission
rule. The algorithm can be described in four simple steps:
1. If the specific path has a path-specific permission, return that permission set.
2. Otherwise, choose the path’s directory as the current directory to test.
3. If the directory being tested has a directory-default permission, return that
permission set.
4. Otherwise set its parent as the current directory to test and go back to step 3.
If there is no path-specific permission, the closest directory-default permission to
the specified path becomes the permission set for that path. By default, peas give the
root directory “/” a directory-default permission denying all permissions; thus, the
default for every file on the system, unless otherwise specified, is deny. This ensures
that the peas have a fail safe default setup and do not allow access to any files unless
specified by the administrator.
The semantics of pea file permission are based on file path name. If a file has more
than one path name, such as via a hard link, both have to be protected by the same
permission; otherwise, depending on what order the file is accessed, the permission set
it gets will be determined simply based on the path name that was accessed initially.
This issue only occurs on creating the initial set of pea file access permissions. Once
73
the pea is set up, any hard links that are created will obey the regular file system
permissions. For instance, one is not allowed to create a hard link to a file that one
does not have permission to. On the other hand, if one has permission to access
the file, a path-specific permission rule will be created for the newly created file that
corresponds to the permission of the path name it was linked to.
The pea architecture uses file system virtualization to integrate the pea file system namespace restrictions into the regular kernel permission model, thereby avoiding TOCTOU race conditions. It accomplishes this by virtualizing the file system’s ->lookup method, which fills in the respective file’s inode structure, and the
->permission method, which uses the stored permission data to make simple permission determinations. A file system’s ->permission method is a standard part of
the operating system’s security infrastructure, so no kernel changes are necessary.
5.2.2
Pea Configuration Rules
5.2.2.1
File System
Many system resources in Unix, including normal files, directories, and system devices,
are accessed via files, so controlling access to the file system is crucial. Each pea must
be restricted to those files used by its component processes. This control is important
for security, because processes that work together do not necessarily need the same
access rights to files. All file system access is controlled by path-specific and directorydefault rules, which specify a file or directory and an access right.
The access right values for file rules are read, write, and execute similar to
standard Unix permissions. For convenience, we also define allow and deny, which
are aliases for all three of read, write and execute and cannot be combined with other
access values in the same rules. When a path-specific or directory-default rule gives
74
access to a directory entry, it implicitly gives execute, but not read or write, access
to all parent directories of the file, up to the root directory. On the other hand,
if a separate path-specific rule denies access to a directory, then access to both the
directory and its contents will be denied. This occurs even if a separate directorydefault rule would give access to subdirectories or files, as the path-specific rule is a
better match.
pod mailserver {
pea sendmail {
path /etc/mail/aliases
path /etc/mail/aliases.db
}
pea newaliases {
path /etc/mail/aliases
path /etc/mail/aliases.db
}
}
read
read
read
read,write
Figure 5.2 – Example of Read/Write Rules
Consider the case of the Sendmail mail daemon and the newaliases command
with regard to the system-wide aliases file. Sendmail runs as the root user and needs
to be able to read the aliases file in order to know to where it should forward mail
or otherwise redirect it. newaliases is a symbolic link to sendmail and typically
also runs as the root user in order to update the aliases file and convert it into the
database format used by the Sendmail daemon. In our example, newaliases runs in
its own pea and is able to read from /etc/mail/aliases and read from and write
to /etc/mail/aliases.db. Meanwhile sendmail runs in another pea and is able to
read both files, but not write to them. We use two path-specific rules to express these
access rules as described in Figure 5.2.
Similar rules can protect a device like /dev/dsp. When a user logs into a system
locally, via the console, they are typically given control of local devices, such as the
pod music {
pea play {
path /dev/dsp
}
pea rec {
path /dev/dsp
}
}
75
write
read
Figure 5.3 – Protecting a Device
physical display and the sound card. Any application that the user runs has access to
read from and write to these local devices, even though this privilege is not necessary.
For example, we want to restrict playing and recording of sound files to the play
and rec applications, which are part of SoX [9]. Figure 5.3 describes the rules that
provide the appropriate access to the device.
The other file system rule is the directory-default rule. It uses the same access
values as path-specific rules, but it is used to specify the default access for files below
a directory. Any file or sub-directory will inherit the same access flags since access is
determined by matching the longest possible path prefix. Unlike path-specific rules,
directory-default rules can deny access to a directory in general, while still allowing
access to specific files. Figure 5.4 describes a pea that denies access to all files in
/bin, while only allowing access to /bin/ls.
pod fileLister {
pea onlyLs {
dir-default /bin
path /bin/ls
}
}
deny
allow
Figure 5.4 – Directory-Default Rule
5.2.2.2
76
Transition Rules
When Sendmail and Procmail are used together to deliver mail to local users, the
sendmail process creates a new process and executes the procmail program to deliver
the mail to the user’s spool. Procmail needs different security settings, so it must
transition from a Sendmail pea to a Procmail pea. Rules must be defined that state
to which pea a process will transition upon execution. When a process calls the
execve system call, we examine the file name to be executed and perform a longest
prefix match on all the transition rules. For instance, by specifying a directory for a
transition, PeaPod will cause a pea transition to occur for any program executed that
is located in that directory, unless there is a more specific transition rule available.
Figure 5.5 creates a pea for Sendmail and Procmail, and specifies that a process
should transition when the procmail program is executed.
pod mailserver
pea sendmail
transition
}
pea procmail
}
}
{
{
/usr/bin/procmail
procmail
{
Figure 5.5 – Transition Rules
PeaPod does not provide the ability for a process to transition to another pea
except by executing a new program. If it could, a process could open an allowed
file in one pea and then transition to another pea where access to that file was not
allowed and thus circumvent the security restrictions.
5.2.2.3
77
Networking Rules
PeaPod provides two rules that define the network capabilities a pea exposes to the
processes running within it. First, peas are able to restrict a process from instantiating
an outgoing connection. Second, peas are able to limit what ports a process can bind
to and listen for incoming connections. By default, peas do not let processes make
any outgoing connections or bind to any port. Whereas a full network firewall is an
important part of any security architecture, it is orthogonal to the goals of PeaPod
and therefore belongs in its own security layer.
Continuing the simplified Sendmail/Procmail usage case, an administrator would
want to easily confine the network presence of processes running within Sendmail/Procmail peas as shown in Figure 5.6. By allowing sendmail to make outgoing connections,
to enable it to send messages, as well as bind to port 25, the standard port for receiving messages, Sendmail can continue to work normally. However, processes running
within the procmail pea, which will be less restricted, are not allowed to bind to any
port for this same reason, while they are allowed to initiate outgoing network connections. This allows programs, such as spam filters that require checking network-based
information, to continue to work.
pod mailserver {
pea sendmail {
outgoing
allow
bind
tcp/25
}
pea procmail {
outgoing
allow
}
}
Figure 5.6 – Networking Rules
5.2.2.4
78
Shared Namespace Rules
PeaPod provides a single namespace rule for allowing processes to access the pod’s
virtual private identifiers that do not belong to its personal pea. PeaPod enables peas
to be configured to only have access to resources tagged with specific pea identifiers
or with the special global pea identifier that enables access to every virtual private
resource in the pod. This rule is used to create a global pea with access to all the
resources of a pod, for instance to allow a process to start up and shut down services
running within a resource-restricted pea. Figure 5.7 describes a pod that has a pea,
global access, that is able to access every resource in the pod, as well as a pea, test1,
that is able to access the resources created within one of its sibling peas, test2.
pod service {
pea global_access {
namespace
global
}
pea test1 {
namespace
test2
}
pea test2 {
}
}
Figure 5.7 – Namespace Access Rules
5.2.2.5
Managing Rules
To make it simpler for administrators to create peas in a pod, we allow groups of rules
to be saved to a file and included in the main configuration file for a given PeaPod
configuration. These groups of rules would typically describe the minimum resources
necessary for a single application. Application packagers can include rule group files
in their package and administrators can share rule groups with each other.
path /usr/bin/gcc
dir-default /usr/lib/gcc-lib
path /usr/bin/cpp
path /usr/lib/libiberty.a
path /usr/bin/ar
path /usr/bin/as
path /usr/bin/ld
path /usr/bin/ranlib
path /usr/bin/strip
79
read,execute
read,execute
read,execute
read
read,execute
read,execute
read,execute
read,execute
read,execute
Figure 5.8 – Compiler Rules
pod workstation {
pea kernel-development {
include "stdlibs"
include "compiler"
include "tar"
include "bzip2"
dir-default /usr/local/src/
read
dir-default /scratch/binaries allow
}
}
Figure 5.9 – Set of Multiple Rule Files
A rule group, seen in Figure 5.8 for a compiler, would be stored in a central
location. An administrator uses an include rule to reference the external file as part
of a development PeaPod. Figure 5.9 contains the tools necessary to build a Linux
kernel from source; it permits access to the source code itself and a writable directory
for the binaries.
These management rules demonstrate PeaPod’s ability to isolate the specific resource needs of individual programs from the local policy an administrator defines.
The knowledge needed to build a set of rules for a program service that provides the
specific set of resources needed to execute is not always readily available to users of
security systems. However, this knowledge is available to the authors and distributors
of the system. PeaPod’s management rules allow the creation and distribution of rule
80
files that define the specific set of resources needed to execute a program service, while
enabling the local administrator to further define the resource-restriction policy.
5.3
Security Analysis
Saltzer and Schroeder [126] describe several principles for designing and building
secure systems. These include:
• Economy of mechanism: Simpler and smaller systems are easier to understand
and ensure that they do not allow unwanted access.
• Fail safe defaults: Systems must choose when to allow access as opposed to
choosing when to deny.
• Complete mediation: Systems should check every access to protected objects.
• Least-privilege: A process should only have access to the privileges and resources
it needs to do its job.
• Psychological acceptability: If users are not willing to accept the requirements
that the security system imposes, such as very complex passwords that the users
are forced to write down, security is impaired. Similarly, if using the system is
too complicated, users will misconfigure it and end up leaving it wide open.
• Work factor : Security designs should force an attacker to have to do extra work
to break the system. The classic quantifiable example is when one adds a single
bit to an encryption key, one doubles the key space an attacker has to search.
PeaPod is designed to satisfy these six principles. PeaPod provides economy of
mechanism using a thin virtualization layer, based on system call interposition for
81
kernel virtualization and file system stacking for file system virtualization, that only
adds a modest amount of code to a running system. The largest part of the system
is due to the use of a null stackable file system with 7000 lines of C code, but this file
system was generated using a simple high-level file system language [152], and only
50 lines of code were added to this well-tested file system to implement PeaPod’s file
system security. Furthermore, PeaPod changes neither applications nor the underlying operating system kernel. The modest amount of code to implement PeaPod
makes the system easier to understand. As the PeaPod security model provides only
resources that are explicitly stated, it is relatively easy to understand the security
properties of resource access provided by the model.
PeaPod provides fail safe defaults by only allowing access to resources that have
been explicitly given to peas and pods. If a resource is not created within a pea, or
explicitly made available to that pea, no process within that pea will be allowed to
access it. Whereas a pea can be configured to enable access to all resources of the
pod, this is an explicit action an administrator has to take.
PeaPod provides for complete mediation of all resources available on the host machine by ensuring that all resource access occur through the pod’s virtual namespace.
Unless a file, process or other operating system resource was explicitly placed in the
pod by the administrator or created within the pod, PeaPod’s virtualization will not
allow a process within a pod to access the resource.
PeaPod provides a least-privilege environment by enabling an administrator to
only include the data necessary for each service. PeaPod can provide separate pods
for individual services so that separate services are isolated and restricted to the
appropriate set of resources. Even if a service is exploited, PeaPod will limit the
attacker to the resources the administrator provided for that service. While one can
achieve similar isolation by running each individual service on a separate machine,
82
this leads to inefficient use of resources. PeaPod maintains the same least-privilege
semantic of running individual services on separate machines, while making efficient
use of machine resources at hand. For instance, an administrator could run MySQL
and Sendmail mail transfer services on a single machine, but within different pods.
If the Sendmail pod gets exploited, the pod model ensures that the MySQL pod
and its data will remain isolated from the attacker. Furthermore, PeaPod’s peas are
explicitly designed to enable least-privilege environments by restricting programs in
an environment that can be easily limited to provide the least amount of access for
the encapsulated program to do its job.
PeaPod provides psychological acceptability by leveraging the knowledge and skills
system administrators already use to set up system environments. Because pods
provide a virtual machine model, administrators can use their existing knowledge and
skills to run their services within pods. Furthermore, peas use a simple resource-based
model that does not require a detailed understanding of any underlying operating
system specifics. This differs from other least-privilege architectures that force an
administrator to learn new principles or complicated configuration languages that
require a detailed understanding of operating system principles.
Similar to least-privilege, PeaPod increases the work factor that it would take to
compromise a system simply by not making available the resources that attackers
depend on to harm a system once they have broken in. For example, because PeaPod
can provide selective access to what programs are included within their view, it would
be very difficult to get a root shell on a system that does not have access to any shell
program. While removing the shell does not create a complete least-privilege system,
it is a simple change that creates a lesser privilege system and therefore increases the
work factor that would be required to compromise the system.
5.4
83
Usage Examples
We briefly describe three examples that help illustrate how the PeaPod virtualization
layer can be used to improve computer security and application availability for different application scenarios. The application scenarios are email delivery, web content
delivery, and desktop computing. In the following examples we make extensive use of
PeaPod’s ability to compose rule files in order to simplify the rules. Instead of listing
every file and library necessary to execute a program, we isolate them into a separate
rule file to place the focus on the actual management of the service that the pea is
trying to protect.
5.4.1
Email Delivery
For email delivery, PeaPod’s virtualization layer can isolate different components
of email delivery to provide a significantly higher level of security in light of the
many attacks on Sendmail vulnerabilities that have occurred [15,16,83,88]. Consider
isolating a Sendmail installation that also provides mail delivery and filtering via
Procmail. Email delivery services are often run on the same system as other Internet
services to improve resource utilization and simplify system administration through
server consolidation. However, this can provide additional resources to services that
do not need them, potentially increasing the damage that can be done to the system
if attacked.
As shown in Figure 5.10, using PeaPod’s virtualization layer, both Sendmail and
Procmail can execute in the same pod, which isolates email delivery from other services on the system. Furthermore, Sendmail and Procmail can be placed in separate
peas, which allows necessary interprocess communication mechanisms between them
while improving isolation. This pod is a common example of a privileged service that
84
pod mail-delivery {
pea sendmail {
include "stdlibs"
include "sendmail"
dir-default /etc
read
dir-default /var/spool/mqueue allow
dir-default /var/spool/mail
allow
dir-default /var/run
allow
path /usr/bin/procmail
read, execute
transition /usr/bin/procmail
procmail
bind
tcp/25
outgoing
allow
}
pea procmail {
dir-default /
allow
outgoing
allow
}
}
Figure 5.10 – Email Delivery Configuration
has child helper applications. In this case, the Sendmail pea is configured with full
network access to receive email, but only with access to files necessary to read its
configuration and to send and deliver email. Sendmail would be denied write access
to file system areas such as /usr/bin to prevent modification to those executables,
and would only be allowed to transition a process to the Procmail pea if it is executing Procmail, the only new program its pea allows it to execute. On mail delivery,
Sendmail would then exec Procmail, which transitions the process into the Procmail
pea. The Procmail pea is configured with a more liberal access permission, namely
allowing access to the pod’s entire file system, enabling it to run other programs, such
as SpamAssassin. Although an administrator could configure programs Procmail executes, such as SpamAssassin, to run within their own peas, this example keeps them
all within a single pea to demonstrate a simple configuration. As a result, the Sendmail/Procmail pod can provide full email delivery service while isolating Sendmail
85
such that even if Sendmail is compromised by an attack, such as a buffer overflow,
the attacker would be contained in the Sendmail pea and would not even be able to
execute programs, such as a root shell, to further compromise the system.
5.4.2
Web Content Delivery
For web content delivery, PeaPod’s virtualization layer can isolate different components of web content delivery to provide a significantly higher level of security in light
of common web server attacks that may exploit CGI script vulnerabilities. Consider
isolating an Apache web server front end, a MySQL database back-end, and CGI
scripts that interface between them. Although one could run Apache and MySQL
in separate pods, because they are providing a single service, it makes sense to run
them within a single pod that is isolated from the rest of the system. However, because both Apache and MySQL are within the pod’s single namespace, if an exploit
is discovered in Apache, it could be used to perform unauthorized modifications to
the MySQL database.
To provide greater isolation among different web content delivery components,
Figure 5.11 describes a set of three peas in a pod: one for Apache, a second for
MySQL, and a third for the CGI programs. Each pea is configured to contain the
minimal set of resources needed by the processes running within the respective pea.
The Apache pea includes the apache binary, configuration files and the static HTML
content, as well as a transition permission to execute all CGI programs into the
CGI pea. The CGI pea contains the relevant CGI programs as well as access to
the MySQL daemon’s named socket, allowing interprocess communication with the
MySQL daemon to perform the relevant SQL queries. The MySQL pea contains
the mysql daemon binary, configuration files and the files that make up the relevant
86
pod web-delivery {
pea apache {
include "stdlibs"
path /usr/sbin/apache
read,execute
path /usr/sbin/apachectl
read,execute
dir-default /var/www
read,execute
transition /var/www/cgi-bin cgi
bind
tcp/80
}
pea cgi {
include "stdlibs"
include "perl"
dir-default /var/www/data
allow
path /tmp/mysql.sock
allow
}
pea mysql {
include "stdlibs"
path /usr/sbin/mysqld read, execute
path /tmp/mysql.sock
allow
dir-default /usr/share/mysql read
dir-default /var/lib/mysql
allow
}
}
Figure 5.11 – Web Delivery Rules
databases. As Apache is the only program exposed to the outside world, it is the the
process that is mostly likely to be directly exploited. However, if an attacker is able
to exploit it, the attacker is limited to a pea that is able only to read or write specific
Apache files, as well as execute specific CGI programs into a separate pea. As the
only way to access the database is through the CGI programs, the only access to the
database an attacker would have is what is allowed by said programs. By writing
the CGI programs carefully to sanitize the inputs passed to them, one can protect
these entry points. Consequently, the ability of an attacker to cause serious harm
to such a web content delivery system running with PeaPod’s virtualization layer is
significantly reduced.
87
pod desktop {
pea firefox {
include "firefox"
dir-default /home/spotter/.mozilla allow
dir-default /home/spotter/tmp
allow
dir-default /home/spotter/download allow
transition /usr/bin/mpg123
mpg123
transition /usr/bin/acroread
acroread
}
pea mp3 {
include "stdlibs"
path /usr/bin/mpg123
read, execute
path /dev/dsp
write
allow
dir-default /home/spotter/music
allow
}
pea acroread {
include "stdlibs"
include "acroread"
allow
}
}
Figure 5.12 – Desktop Application Rules
5.4.3
Desktop Computing
For desktop computing, PeaPod’s virtualization layer enables desktop computing environments to run multiple desktops from different security domains within multiple
pods. Peas can also be used within the context of such a desktop computing environment to provide additional isolation. Many applications used on a daily basis,
such as mp3 players [64] and web browsers [67], have had bugs that turn into security
holes when maliciously created files are viewed by them. These holes allow attackers
to execute malicious code or gain access to the entire local file system. Figure 5.12
describes a set of PeaPod rules that can contain a small set of desktop applications
being used by a user with the /home/spotter home directory.
88
To secure an mp3 player, a pea can be created within the desktop computing pod
that restricts the mp3 player’s use of files outside of a special mp3 directory. As
most users store their music within its own subtree, this is not a serious restriction.
Most mp3 content should not be trusted, especially if one is streaming mp3s from a
remote site. By running the mp3 player within this fully restricted pea, a malicious
mp3 cannot compromise the user’s desktop session. This mp3 player pea is simply
configured with four file system permissions. First, a path-specific permission that
provides access to the mp3 player itself is required to load the application. Second, a
directory-default permission that provides access to the entire mp3 directory subtree
is required to give the process access to the mp3 file library. Third is a directorydefault permission to a directory meant to store temporary files so the mp3 player
can be used as a helper application. Finally, a path-specific permission that provides
access to the /dev/dsp audio device is required to allow the process to play audio.
To secure a web browser, a pea can be created within a desktop computing pod
that restricts the web browser’s access to system resources. Consider the Mozilla Firefox web browser as an example. A Firefox pea would need to have all the files Firefox
needs to run accessible from within the pea. Mozilla dynamically loads libraries and
stores them along with its plugins within the /usr/lib/firefox directory. By providing a directory-default permission that provides access to that directory, as well
as another directory-default permission that provides access to the user’s .mozilla
directory, the Firefox web browser can run normally within this special Firefox pea.
Users also want the ability to download and save files, as well as launch viewers, such
as for postscript or mp3 files, directly from the web browser. This involves a simple
reconfiguration of Firefox to change its internal application.tmp dir variable to be
a directory that is writable within the Mozilla pea. By creating such a directory,
such as download within the user’s home directory, and providing a directory-default
89
permission allowing access, we allow one to explicitly save files, as well as implicitly
save them when one wants to execute a helper application. Similarly, just like Mozilla
is configured to run helper applications for certain file types, one could configure the
Mozilla pea to execute those helper applications within their respective peas. As
shown in Figure 5.12, for an mp3 player, configuring such a pea for these processes is
fairly simple. The only addition one would have to make is to provide an additional
pea transition permission to the Mozilla pea that tells the PeaPod’s virtualization
layer to transition the process to a separate pea on execution of programs such as the
mpg123 mp3 player or the Acrobat Reader PDF viewer.
However, this desktop computing example is also the most complicated, and shows
the difficulty that can occur in trying to secure a complex desktop. In this example we
only attempt to secure a simplified desktop and isolate three applications, and yet it
is the largest rule set. Many desktop environments are made up of many applications
and each application would need its own set of rules. To avoid the need to create
rules for each individual application, we created Apiary, described in Chapter 7, to
specifically address desktop security.
5.5
We implemented PeaPod’s virtualization layer as a loadable kernel module in Linux
that requires no changes to the Linux kernel source code or design. We present
experimental results using our Linux prototype to quantify the overhead of using
PeaPod on various applications. Experiments were conducted on two IBM Netfinity
4500R machines, each with a 933Mhz Intel Pentium-III CPU, 512MB RAM, 9.1 GB
SCSI HD and a 100 Mbps Ethernet connected to a 3Com Superstack II 3900 switch.
One of the machines was used as an NFS server from which directories were mounted
Name
getpid
ioctl
semaphore
fork-exit
fork-sh
Postmark
Apache
Make
MySQL
90
Description
average getpid runtime
average runtime for the FIONREAD ioctl
IPC Semaphore variable is created and removed
process forks and waits for child that calls exit immediately
process forks and waits for child to run /bin/sh to run a program that prints ”hello world” then exits
Use Postmark Benchmark to simulate Sendmail performance
Runs Apache 1.3 under load and measures average request time
Linux Kernel 2.4.21 compile with up to 10 processes active at
one time
“TPC-W like” interactions benchmark that uses Tomcat 4 and
MySQL 4
Table 5.1 – Application Benchmarks
to construct the virtual file system for the PeaPod on the other client system. The
client ran Debian stable with a 2.4.21 kernel.
To measure the cost of PeaPod’s virtualization layer, we used a range of microbenchmarks and real application workloads and measured their performance on our
Linux PeaPod prototype and a vanilla Linux system. Table 5.1 shows the five microbenchmarks and four application benchmarks we used to quantify PeaPod’s virtualization overhead. To obtain accurate measurements, we rebooted the system between
measurements. The files for the benchmarks were stored on the NFS server. All
of these benchmarks were performed in a chrooted environment on the NFS client
machine running Debian Unstable. Figure 5.13 shows the results of running the
benchmarks under both configurations, with the vanilla Linux configuration normalized to one. Since all benchmarks measure the time to run the benchmark, a smaller
number is better for all benchmarks results.
The results in Figure 5.13 show that the PeaPod’s pea virtualization layer, as
expected, imposes negligible overhead over the already existing pod virtualization.
This is because to enforce resource isolation, all PeaPod has to do is compare the re-
1.6
Plain Linux
PeaPod
1.5
Normalized performance
91
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
L
Q
he
e
yS
M
ak
M
k
ar
m
ac
Ap
st
Po
h
-s
rk
fo
it
re
ho
ex
d
ap
rk
fo
m
se
l
ct
io
i
tp
ge
Figure 5.13 – PeaPod Virtualization Overhead
source’s pea attribute against the process trying to access it. For PIDs and IPC keys,
it is a single equality test, which is minimal extra work beyond looking up the virtualized mapping in a hash table. On the other hand, for file system entries it might
have to iterate through a small set of rules. Furthermore, this only matters for file
system operations that care about permissions, such as open; for all other file system
operations, such as read and write, there is no extra pea overhead. Therefore, just
like *Pod, PeaPod incurs less than 10% overhead for most of the micro-benchmarks
and less than 4% overhead for the application workloads. For the system call microbenchmarks, PeaPod has to do very little extra work to restrict each process to its
pea context. Similarly, PeaPod has to do very little work to virtualize the file system.
This is apparent from both the Postmark benchmark and a set of real applications.
Postmark was configured to operate on files between 512 and 10K bytes in size, representative of the individual files on a mail server queue, with an initial set of 20,000
files and to perform 200,000 transactions. PeaPod exhibits very little overhead in the
92
postmark benchmark as it does not require any additional I/O operations. In addition, PeaPod exhibited a maximum overhead of 4% in real application benchmarks.
This overhead was measured using the http load benchmark [108] to place a parallel fetch load on an Apache 1.3 server by simulating 30 parallel clients continuously
fetching a set of files and measuring the average request time for each HTTP session.
Similarly, we tested MySQL as part of a web commerce scenario outlined by TPC-W
with a bookstore servlet running on top of Tomcat 4 with a MySQL 4 back-end. The
PeaPod overhead for this scenario was less than 2% versus vanilla Linux.
5.6
Related Work
A number of other approaches have explored the idea of virtualizing the operating
system environment to provide application isolation. FreeBSD’s Jail mode [74] provides a chroot-like environment that processes cannot break out of. However, Jail
is limited, such as the fact that it does not allow IPC within a jail [57], and therefore many real-world application will not work. More recently, Linux Vserver [7] and
Containers [6], and Solaris Zones [116] offer a VM abstraction similar to PeaPod’s
pods, but require substantial in-kernel modifications to support the abstraction. Although these systems share the simplicity of the Pod abstraction, they do not provide
finer-grained isolation as provided with peas.
Similarly, VMMs have been used to provide secure isolation [28, 142, 147]. Unlike
PeaPod’s virtualization layer, VMMs decouple processes from the underlying machine
hardware, but tie them to an instance of an operating system. As a result, VMMs
provide an entire operating system instance and namespace for each VM and lack the
ability to isolate components within an operating system. If a single process in a VM
is exploitable, malicious code can use it to access the entire set of operating system
93
resources. As PeaPod’s virtualization layer decouples processes from the underlying
operating system and its resulting namespace, it is natively able to limit the separate
processes of a larger system to the appropriate resources needed by them. Furthermore, VMMs require more administrative overhead due to requiring administration
of multiple full OS instances, as well as imposing higher memory overhead due to the
requirements of the underlying operating system.
Many systems have been developed to isolate untrusted applications. NSA’s Security Enhanced Linux [86], which is based upon the Flask Architecture [133], implements a policy language that is used to implement models that enforce privilege
separation. The policy language is very flexible but also very complex to use. The
example security policy is over 80 pages long. There is research into creating tools to
make policy analysis tractable [21], but the language’s complexity makes it difficult
for the average end user to construct an appropriate policy.
System call interception has been used by systems such as Janus [59, 144], Systrace [118], MAPbox [17], Software Wrappers [80] and Ostia [60]. These systems can
enable flexible access controls per system call, but they have been limited by the
difficulty of creating appropriate policy configurations. TRON [30], SubDomain [49]
and Alcatraz [84] also operate at the system call level but focus on limiting access to
the underlying file system. TRON allows transitions between different isolation units
but requires application modifications to use this feature, while SubDomain supports
an implicit transition on execution of a new child process. These systems provide a
model somewhat similar to the file system approach used by PeaPod peas. However,
the pea’s file system virtualization is based on a full-fledged stackable file system that
integrates fully with regular kernel security infrastructure and provides much better
performance. Similarly, the PeaPod’s kernel virtualization layer provides a complete
process-isolation solution that is not just limited to file system protection.
94
Safer languages and runtime environments, most notably Java, have been developed to prevent common software errors and isolate applications in language-based
virtual machine environments. These solutions require applications to be rewritten or
recompiled, often with some loss in performance. Other language-based tools [25, 48]
have also been developed to harden applications against common attacks, such as
buffer overflow attacks. PeaPod’s virtualization layer complements these approaches
by providing isolation of legacy applications without modification.
Chapter 6
Strata: Managing Large Numbers
of Machines
A key problem organizations face is how to efficiently provision and maintain the
large number of machines deployed throughout their organizations. This problem is
exemplified by the growing adoption and use of virtual appliances (VAs). VAs are
pre-built software bundles run inside virtual machines (VMs). For example, one VM
might be tailored to be a web server VA, while another might be tailored to be a
desktop computing VA. Since VAs are often tailored to a specific application, these
configurations can be smaller and simpler, potentially resulting in reduced resource
requirements and more secure deployments.
VAs simplify application deployment. Once an application is installed in a VA,
it is easily deployed by end users with minimal hassle because both the software
and its configuration have already been set up in the VA. A new VA can be easily
created by cloning an existing VA that already contains a base installation of the
necessary software, then modifying it by adding applications and changing the system
configuration. There is no need to set up the common components from scratch.
Chapter 6. Strata: Managing Large Numbers of Machines
96
But while virtualization and VAs decrease the cost of hardware, they can tremendously increase the human cost of administering these machines. As VAs are cloned
and modified, creating an ever-increasing sprawl of different configurations, organizations that once had a few hardware machines to manage now find themselves juggling
many more VAs with diverse system configurations and software installations. For
instance, in the past, an organization would have run services such as web, mail, databases, file storage and shell access on a single machine because these services share
many common files. By dividing these services into separate VAs, instead of a single
machine with five services, one now has five independent VAs.
This causes many management problems. First, because these VAs share a lot
of common data, they are inefficient to store, as there are multiple copies of many
common files. Although storage is cheap, the bandwidth available to write data to the
disks is not. Copying the VA into place is extremely time-consuming and negatively
impacts the performance of the other VAs running on the system.
Second, by increasing the number of systems in use, we increase the number of
systems needing security updates. Although software patches are released for security
threats, constantly deploying patches and upgrading software creates a management
nightmare as the number of VAs in the enterprise continues to rise. Many VAs may
be turned off, suspended, or not even actively managed, making patch deployment
before a security problem hits even more difficult. This problem is exacerbated by
the large number of VAs in a large organization. Although the management of any
one VA may not be difficult, the need to manage many different VAs results in a huge
scaling problem for large organizations as management overhead grows linearly with
the number of VAs needing maintenance. Instead of a single update for a machine
running five services, the administrator now must apply the update five separate
times.
97
Finally, as VAs are increasingly networked, the management problem only grows,
given the myriad viruses and other attacks commonplace today. Security problems
can wreak havoc on an organization’s virtual computing infrastructure. While virtualization can improve security via isolation, the sprawl of machines increases the
number of hiding places for an attacker. Instead of a single actively used machine to
monitor for malicious changes, administrators now have to monitor many less used
machines. Furthermore, as VAs are designed to be dropped in place and not actively
managed, administrators might not even know what VAs have been put into use by
their end users.
Many approaches have been used to address these problems, including traditional
package management systems [4, 56], copy-on-write disks [91] and new VM storage
formats [41, 103]. Unfortunately, these approaches suffer from various drawbacks
that limit their utility and effectiveness in practice. They either incur management
overheads that grow with the number of VAs, or require all VAs to have the same
configuration, eliminating the main advantages of VAs. The fundamental problem
with previous approaches is that they are based on a monolithic file system or block
device. These file systems and block devices address their data at the block layer
and are simply used as a storage entity. They have no direct concept of what the
file system contains or how it is modified. However, managing VAs is essentially
done by making changes to the file system. As a result, any upgrade or maintenance
operation needs to be done to each VA independently, even when they all need the
same maintenance.
We have built Strata, a novel system that integrates file system unioning with
package management semantics, by introducing the Virtual Layered File System
(VLFS) and using it to solve VA management problems. Strata makes VA creation
and provisioning fast. It simplifies the regular maintenance and upgrades that must
98
be performed on provisioned VA instances. Finally, it improves the administrator’s
ability to detect and recover from security exploits.
Strata achieves this by providing three architectural components: layers, layer
repositories and Virtual Layered File Systems. A layer is a file hierarchy of related
files that are typically installed and upgraded as a unit. Layers are analogous to
software packages in package management systems. Like software packages, a layer
may require other layers to function correctly, just as applications often require various
system libraries to run. Strata associates dependency information with each layer that
defines relationships among layers. Unlike software packages, which must be installed
into each VA’s file system, layers can be shared directly among multiple VAs.
Layer repositories are used to store layers centrally within a virtualization infrastructure, enabling them to be shared among multiple VAs. Layers are updated and
maintained in the layer repository. For example, if a new version of an application
becomes available, a new layer is added to the repository. If a patch for an application is issued, the corresponding layer is patched by creating a new layer with the
patch. Different versions of the same application may be available through different
layers in the layer repository. The layer repository is typically stored in a shared
storage infrastructure accessible by the VAs, such as a Storage Area Network (SAN).
Storing layers on the SAN does not impact VA performance because a SAN is where
a traditional VA’s monolithic file system is stored.
The VLFS is the file system for a VA. Unlike a traditional monolithic file system, it is a collection of individual layers dynamically composed into a single view.
This is analogous to a traditional file system managed by a package manager that
is composed of many packages extracted into it. Each VA has its own VLFS, which
typically consists of a private read-write layer and a set of read-only layers shared
through the layer repository. The private read-write layer is used for all file system
99
modifications private to the VA that occur during runtime, such as modifying user
data. The shared read-only layers allow VAs with very different system configurations
and applications to share common layers representing software components common
across VAs. Layer changes to shared layers only need be done once in the repository and are then automatically propagated to all VLFSs, resulting in management
overhead independent of the number of VAs.
By dynamically building a VLFS out of discrete layers, Strata introduces file
system unioning as the package management semantic. This provide a number of
management benefits. First, Strata is able to create and provision VAs more quickly
and easily. To create a template VA, an administrator just selects the applications
and tools of interest from the layer repository. The template VA’s VLFS automatically unions the selected layers together with a read-write layer and incorporates any
additional layers needed to resolve any necessary dependencies. This template VA
then becomes the single image end users in an enterprise will use when they want to
use this service. End users can instantly create provisioned instances of this template
VA because no copying or on-demand paging is needed to instantiate its file system,
as all the layers are accessed from the shared layer repository. The layer repository
allows easy identification of the applications and tools of interest, and the VLFS automatically resolves dependencies on other layers, so provisioning VAs is relatively
easy. Because VAs are just defined by their associated sets of layers, Strata also offers
a new way to build VAs simply by combining existing ones.
Second, Strata simplifies upgrades and maintenance of provisioned VAs. If a layer
contains a bug to be fixed, the administrator creates a replacement layer with the
fix and updates the template VA. This informs the provisioned VAs to incorporate
the layer into their VLFS’s namespace view. Traditional VAs, which are provisioned
and updated by replacing their file system [41, 103], have to be rebooted in order to
100
incorporate changes by making use of a new block device. Strata, however, allows
online upgrades like a traditional package management system. Unlike package management system upgrades, in which a significant amount of time is spent deleting the
existing files and copying the new files into place, upgrades in a VLFS are atomic,
preventing the file system from ever being in an inconsistent state.
Finally, this semantic allows VAs managed by Strata to easily recover from security
exploits. VLFSs distinguish between files installed via its package manager, which are
stored in a shared read-only layer, and the changes made over time, which are stored
in the private read-write layer. If a VA is compromised and an attacker installs new
malware or modifies an existing application, these changes will be separated from the
deployed system’s initial state and isolated to the read-write layer. Such changes are
easier to identify and remove, returning the VA to a clean state.
6.1
Strata Basics
Figure 6.1 shows Strata’s three architectural components: layers, layer repositories
and VLFSs. A layer is a distinct self-contained set of files that corresponds to a specific
functionality. Strata classifies layers into three categories: software layers with selfcontained applications and system libraries, configuration layers with configuration
file changes for a specific VA, and private layers allowing each provisioned VA to be
independent. Layers can be mixed and matched, and may depend on other layers.
For example, a single application or system library is not fully independent, but
depends on the presence of other layers, such as those that provide needed shared
libraries. Strata enables layers to enumerate their dependencies on other layers. This
dependency scheme allows automatic provisioning of a complete, fully consistent file
system by selecting the main features desired within the file system.
101
Figure 6.1 – How Layers, Repositories, and VLFSs Fit Together
Layers are provided through layer repositories. As Figure 6.1 shows, a layer repository is a file system share containing a set of layers made available to VAs. When
an update is available, the old layer is not overwritten. Instead, a new version of
the layer is created and placed within the repository, making it available to Strata’s
users. Administrators can also remove layers from the repository, e.g., those with
known security holes, to prevent them from being used. Layer repositories are generally stored on centrally managed file systems, such as a SAN or NFS, but they can also
be provided by protocols such as FTP and HTTP and mirrored locally. Layers from
multiple layer repositories can form a VLFS as long as they are compatible with one
another. This allows layers to be provided in a distributed manner. Layers provided
by different maintainers can have the same layer names, causing a conflict. This,
however, is no different from traditional package management systems as packages
102
with the same package name, but different functionality, can be provided by different
package repositories.
As Figure 6.1 shows, a VLFS is a collection of layers from layer repositories that
are composed into a single file system namespace. The layers making up a particular
VLFS are defined by the VLFS’s layer definition file, which enumerates all the layers
that will be composed into a single VLFS instance. To provision a VLFS, an administrator selects software layers that provide the desired functionality and lists them
in the VLFS’s layer definition file.
Within a VLFS, layers are stacked on top of another and composed into a single
file system view. An implication of this composition mechanism is that layers on top
can obscure files on layers below them, only allowing the contents of the file instance
contained within the higher level to be used. This means that files in the private
or configuration layers can obscure files in lower layers, such as when one makes a
change to a default version of a configuration file located within a software layer.
However, to prevent an ambiguous situation from occurring, where the file system’s
contents depend on the order of the software layers, Strata prevents software layers
that contain the same file names from being composed into a single VLFS.
6.2
Strata Usage Model
Strata’s usage model is centered around the usage of layers to quickly create VLFSs
for VAs as shown in Figure 6.1. Strata allows an administrator to compose together
layers to form template VAs. These template VAs can be used to form other template
appliances that extend their functionality, as well as to provide the VA that end users
will provision and use. Strata is designed to be used within the same setup as a
traditional VM architecture. This architecture includes a cluster of physical machines
103
that are used to host VM execution as well as a shared SAN that stores all the VM
disk images that can be executed. However, instead of storing disk images on the
SAN, Strata stores the layers that will be used by the VMs it manages.
6.2.1
Creating Layers and Repositories
Layers are first created and stored in layer repositories. Layer creation is similar
to the creation of packages in a traditional package management system, where one
builds the software, installs it into a private directory, and turns that directory into
a package archive, or in Strata’s case, a layer. For instance, to create a layer that
contains the MySQL SQL server, the layer maintainer would download the source
archive for MySQL, extract it, and build it normally. However, instead of installing
it into the system’s root directory, one installs it into a virtual root directory that
becomes the file system component of this new layer. The layer maintainer then
defines the layer’s metadata, including its name (mysql-server in this case) and
an appropriate version number to uniquely identify this layer. Finally, the entire
directory structure of the layer is copied into a layer repository, making the layer
available to users of that repository.
6.2.2
Creating Appliance Templates
Given a layer repository, an administrator can then create template VAs. Creating a
template VA involves:
1. Creating the template VA with an identifiable name and the VLFS it will use.
2. Determining what repositories are available to it.
3. Selecting a set of layers that provide the functionality desired.
104
For example, to create a template VA that provides a MySQL SQL server, an
administrator creates an appliance/VLFS named sql-server and selects the layers needed for a fully functional MySQL server file system, most importantly, the
mysql-server layer. Strata composes these layers together into the VLFS in a readonly manner along with a read-write private layer, making the VLFS usable within
a VM. The administrator boots the VM and makes the appropriate configuration
changes to the template VA, storing them within the VLFS’s private layer. Finally,
the private layer belonging to the template appliance’s VLFS is frozen and becomes
the template’s configuration layer. As another example, to create an Apache web
server appliance, an administrator creates an appliance/VLFS named web-server,
and selects the layers required for an Apache web server, most importantly, the layer
containing the Apache httpd program.
Strata extends this template model by allowing multiple template VAs to be composed together into a single new template. For example, an administrator can create
a new template VA/VLFS, sql+web-server, composed of the MySQL and Apache
template VAs. The resulting VLFS has the combined set of software layers from both
templates, both of their configuration layers, and a new configuration layer containing
the configuration state that integrates the two services together, for a total of three
configuration layers.
6.2.3
Provisioning and Running Appliance Instances
Given templates, VAs are efficiently and quickly provisioned and deployed by end
users by cloning the available templates. Provisioning a VA involves:
1. Creating a virtual machine container with a network adapter and an virtual
disk.
105
2. Using the network adapter’s MAC address as the machine’s identifier for identifying the VLFS created for this machine.
3. Forming the VLFS by referencing the already existing template VLFS and combining the template’s read-only software and configuration layers with a readwrite private layer provided by the VM’s virtual disk.
As each VM managed by Strata does not have a physical disk off which to boot,
Strata network boots each VM. When the VM boots, its BIOS discovers a network
boot server which provides it with a boot image, including a base Strata environment.
The VM boots this base environment, which then determines which VLFS should be
mounted for the provisioned VM using the MAC address of the machine. Once the
proper VLFS is mounted, the machine transitions to using it as its root file system.
6.2.4
Updating Appliances
Strata upgrades provisioned VAs efficiently using a simple three-step process. First,
an updated layer is installed into a shared layer repository. Second, administrators are
able to modify the template appliances under their control to incorporate the update.
Finally, all provisioned VAs based on that template will automatically incorporate the
update as well. Note that updating appliances is much simpler than updating generic
machines, as appliances are not independently managed machines. This means that
extra software that can conflict with an upgrade will not be installed into a centrally
managed appliance. Centrally managed appliance updates are limited to changes to
their configuration files and what data files they store.
Strata’s updates propagate automatically even if the VA is not currently running.
If a VA is shut down, the VA will compose whatever updates have been applied to its
templates automatically, never leaving the file system in a vulnerable state, because
106
it composes its file system afresh each time it boots. If it is suspended, Strata delays
the update to when the VA is resumed, as updating layers is a quick task. Updating
is significantly quicker than resuming, so this does not add much to its cost.
Furthermore, VAs are upgraded atomically, as Strata adds and removes all the
changed layers in a single operation. This is not like a traditional package management
system which, when upgrading a package, first uninstalls it before reinstalling the
newer version. The traditional method leaves the file system in an inconsistent state
for a short period of time because it is possible that files needed for program execution
may not be available. For instance, when the libc package is upgraded, its contents
are first removed from the file system before being replaced. Any application that
tries to execute during the interim will fail to dynamically link because the main
library on which it depends is not present within the file system at that moment.
6.2.5
Improving Security
Strata makes it much easier to manage VAs that have had their security compromised.
By dividing a file system into a set of shared read-only layers and storing all file system
modifications inside the private read-write layer, Strata separates changes made to the
file system via layer management from regular runtime modifications. This enables
Strata to easily determine when system files have been compromised as the changes
will be readily visible in the private layer. This allows Strata to not rely on tools like
Tripwire [79] or maintain separate databases to determine if files have been modified
from their installed state. Similarly, this check can be run external to the VA, as it
just needs access to the private layer, thereby preventing an attacker from disabling
it. This reduces management load due to not requiring any external databases be
kept in sync with the file system state as it changes.
107
This segregation of modified file system state also enables quick recovery from a
compromised system. By simply replacing the VA’s private layer with a fresh private
layer, the compromised system is immediately fixed by returning it to its default
freshly provisioned state. However, unlike reinstalling a system from scratch, replacing the private layer does not require throwing away the contents of the old private
layer. Strata enables the layer to be mounted within the file system, enabling administrators to have easy access to the files located within it to move the uncompromised
files back to their proper places.
6.3
Virtual Layered File System
Strata introduces the concept of a virtual layered file system in place of traditional
monolithic file systems. Strata’s VLFS allows file systems to be created by composing
layers together into a single file system namespace view. Strata allows these layers
to be shared by multiple VLFSs in a read-only manner or to remain read-write and
private to a single VLFS.
Every VLFS is defined by a layer definition file (LDF), which specifies what
software layers should be composed together. An LDF is a simple text file that
lists the layers and their respective repositories. The LDF’s layer list syntax is
repository/layer version and can be preceded by an optional modifier command.
When an administrator wants to add or remove software from the file system, instead
of modifying the file system directly, they modify the LDF by adding or removing
the appropriate layers.
Figure 6.2 contains an example LDF for a MySQL SQL server template appliance.
The LDF lists each individual layer included in the VLFS along with its corresponding
repository. Each layer also has a number indicating which version will be composed
108
into the file system. If an updated layer is made available, the LDF is updated to
include the new layer version instead of the old one. If the administrator of the VLFS
does not want to update the layer, they can hold a layer at a specific version, with
the = syntax element. This is demonstrated by the mailx layer in Figure 6.2, which
is being held at the version listed in the LDF.
Strata allows an administrator to explicitly select only the few layers corresponding
to the exact functionality desired within the file system. Other layers needed in the file
system are implicitly selected by the layers’ dependencies as described in Section 6.3.2.
Figure 6.2 shows how Strata distinguishes between explicitly and implicitly selected
layers. Explicitly selected layers are listed first and separated from the implicitly
selected layers by a blank line. In this case, the MySQL server has only one explicit
layer, mysql-server, but has 21 implicitly selected layers. These include utilities such
as Perl and TCP Wrappers (tcpd), as well as libraries such as OpenSSL (libssl). It also
includes a layer providing a shared base common to all VLFSs. Strata distinguishes
explicit layers from implicit layers to allow future reconfigurations to remove one
implicit layer in favor of another if dependencies need to change.
When an end user provisions an appliance by cloning a template, an LDF is
created for the provisioned VA. Figure 6.3 shows an example introducing another
syntax element, @, that instructs Strata to reference another VLFS’s LDF as the
basis for this VLFS. This lets Strata clone the referenced VLFS by including its
layers within the new VLFS. In this case, because the user wants only to deploy
the SQL server template, this VLFS LDF only has to include the single @ line. In
general, a VLFS can reference more than one VLFS template, assuming that layer
dependencies allow all the layers to coexist.
109
main/mysql-server 5.0.51a-3
main/base 1
main/libdb4.2 4.2.52-18
main/apt-utils 0.5.28.6
main/liblocale-gettext-perl 1.01-17
main/libtext-charwidth-perl 0.04-1
main/libtext-iconv-perl 1.2-3
main/libtext-wrapi18n-perl 0.06-1
main/debconf 1.4.30.13
main/tcpd 7.6-8
main/libgdbm3 1.8.3-2
main/perl 5.8.4-8
main/psmisc 21.5-1
main/libssl0.9.7 0.9.7e-3
main/liblockfile1 1.06
main/adduser 3.63
main/libreadline4 4.3-11
main/libnet-daemon-perl 0.38-1
main/libplrpc-perl 0.2017-1
main/libdbi-perl 1.46-6
main/ssmtp 2.61-2
=main/mailx 3a8.1.2-0.20040524cvs-4
Figure 6.2 – Layer Definition for MySQL Server
@main/sql-server
Figure 6.3 – Layer Definition for Provisioned Appliance
6.3.1
Layers
Strata’s layers are composed of three components: metadata files, the layer’s file
system and configuration scripts. The metadata files define the information that
describes the layer. This includes its name, version and dependency information.
This information is important to ensure that a VLFS is composed correctly. The
metadata file contains all the metadata that is specified for the layer. Figure 6.4
shows an example metadata file. Figure 6.5 shows the full metadata syntax. The
metadata file has a single field per line with two elements, the field type and the field
contents. In general, the metadata file’s syntax is Field Type:
value, where value
110
can be either a single entry or a comma-separated list of values.
The layer’s file system is a self-contained set of files providing a specific functionality. The files are the individual items in the layer that are composed into a larger
VLFS. There are no restrictions on the types of files that can be included. They can
be regular files, symbolic links, hard links or device nodes. Similarly, each directory
entry can be given whatever permissions are appropriate. A layer can be seen as a
directory stored on the shared file system that contains the same file and directory
structure that would be created if the individual items were installed into a traditional
file system. On a traditional UNIX system, the directory structure would typically
contain directories such as /usr, /bin and /etc. Symbolic links work as expected
between layers since they work on path names, but one limitation is that hard links
cannot exist between layers.
The layer’s configuration scripts are run when a layer is added or removed from a
VLFS to allow proper integration of the layer within the VLFS. Although many layers
are just a collection of files, other layers need to be integrated into the system as a
whole. For example, a layer that provides MP3 file playing capability should register
itself with the system’s MIME database to allow programs contained within the layer
to be launched automatically when a user wants to play an MP3 file. Similarly, if the
layer were removed, it should remove the programs contained within itself from the
MIME database.
Strata supports four types of configuration scripts: pre-remove, post-remove, preinstall and post-install. If they exist in a layer, the appropriate script is run before
or after a layer is added or removed. For example, a pre-remove script can be used to
shut down a daemon before it is actually removed, while a post-remove script can be
used to clean up file system modifications in the private layer. Similarly, a pre-install
script can ensure that the file system is as the layer expects, while the post-install
111
Layer: mysql-server
Version: 5.0.51a-3
Depends: ..., perl (>= 5.6),
tcpd (>= 7.6-4),...
Figure 6.4 – Metadata for MySQL Server Layer
Layer: Layer Name
Version: Version of Layer Unit
Conflicts: layer1 (opt. constraint), ...
Depends: layer1 (...),
layer2 (...) | layer3, ...
Pre-Depends: layer1 (...), ...
Provides: virtual_layer, ...
Figure 6.5 – Metadata Specification
script can start daemons included in the layer. The configuration scripts can be
written in any scripting language. The layer must include the proper dependencies
to ensure that the scripting infrastructure is composed into the file system in order
to allow the scripts to run.
Layers are stored on disk as a directory tree named by the layer’s name and
version. For instance, version 5.0.51a of the MySQL server, with a Strata layer
version of 3, would be stored under the directory mysql-server 5.0.51a-3. Within
this directory, Strata defines a metadata file, a filesystem directory and a scripts
directory corresponding to the layer’s three components.
6.3.2
Dependencies
A key Strata metadata element is enumeration of the dependencies that exist between
layers. Strata’s dependency scheme is heavily influenced by the dependency scheme
in Linux distributions such as Debian and Red Hat. In Strata, every layer composed
into Strata’s VLFS is termed a layer unit. Every layer unit is defined by its name
and version. Two layer units that have the same name but different layer versions are
112
different units of the same layer. A layer refers to the set of layer units of a particular
name. Every layer unit in Strata has a set of dependency constraints placed within
its metadata. There are four types of dependency constraints:
• dependency
• pre-dependency
• conflict
• provide
Dependency and Pre-Dependency: Dependency and pre-dependency constraints are similar in that they require another layer unit to be integrated at the
same time as the layer unit that specifies them. They differ only in the order the
layer’s configuration scripts are executed to integrate them into the VLFS. A regular
dependency does not dictate order of integration. A pre-dependency dictates that the
dependency has to be integrated before the dependent layer. Figure 6.4 shows that the
MySQL layer depends on TCP Wrappers (tcpd) because it dynamically links against
the shared library libwrap.so.0 provided by TCP Wrappers. MySQL cannot run
without this shared library, so the layer units that contain MySQL must depend on a
layer unit containing an appropriate version of the shared library. These constraints
can also be versioned to further restrict which layer units satisfy the constraint. For
example, shared libraries can add functionality that breaks their application binary
interface (ABI), breaking in turn any applications that depend on that ABI. Since
MySQL is compiled against version 0.7.6 of the libwrap library, the dependency constraint is versioned to ensure that a compatible version of the library is integrated at
the same time.
113
Conflict: Conflict constraints indicate that layer units cannot be integrated into
the same VLFS. This generally occurs because the layer units depend on exclusive
access to the same operating system resource. This can be a TCP port in the case
of an Internet daemon, or two layer units that contain the same file pathnames and
therefore would obscure each other. For this reason, two layer units of the same layer
are by definition in conflict because they will contain some of the same files.
An example of this constraint occurs when the ABI of a shared library changes
without any source code changes, generally due to an ABI change in the tool chain
that builds the shared library. Because the ABI has changed, the new version can
no longer satisfy any of the previous dependencies. But because nothing else has
changed, the file on disk will usually not be renamed either. A new layer must then
be created with a different name, ensuring that the library with the new ABI is
never used to satisfy an old dependency on the original layer. Because the new layer
contains the same files as the old layer, it must conflict with the older layer to ensure
that they are not integrated into the same file system.
Provide: Provide dependency constraints introduce virtual layers. A regular
layer provides a specific set of files, but a virtual layer indicates that a layer provides
a particular piece of general functionality. Layer units that depend on a certain piece
of general functionality can depend on a specific virtual layer name in the normal
manner, while layer units that provide that functionality will explicitly specify that
they do. For example, layer units that provide webmail or content management software depend on the presence of a web server, but which one is not important. Instead
of depending on a particular web server, they depend on the virtual layer name httpd.
Similarly, layer units containing a web server, such as Apache or Boa, are defined to
provide the httpd virtual layer name and therefore satisfy those dependencies. Unlike
regular layer units, virtual layers are not versioned.
6.3.2.1
114
Dependency Example
Figure 6.2 shows how dependencies can affect a VLFS in practice. This VLFS has only
one explicit layer, mysql-server, but 21 implicitly selected layers. The mysql-server
layer itself has a number of direct dependencies, including Perl, TCP Wrappers, and
the mailx program. These dependencies in turn depend on the Berkeley DB library
and the GNU dbm library, among others. Using its dependency mechanism, Strata
is able to automatically resolve all the other layers needed to create a complete file
system by specifying just a single layer.
Returning to Figure 6.4, this example defines a subset of the layers that the mysqlserver layer requires to be composed into the same VLFS to allow MySQL to run
correctly. More generally, Figure 6.5 shows the complete syntax for the dependency
metadata. Provides is the simplest, with only a comma-separated list of virtual layer
names. Conflicts adds an optional version constraint to each conflicted layer to limit
the layer units that are actually in conflict. Depends and Pre-Depends add a boolean
or of multiple layers in their dependency constraints to allow multiple layers to satisfy
the dependency.
6.3.2.2
Resolving Dependencies
To allow an administrator to select only the layers explicitly desired within the VLFS,
Strata automatically resolves dependencies to determine which other layers must be
included implicitly. To allow dependency resolution, Strata first provides a database
of all the available layer units’ locations and metadata. The collection of layer units
can be viewed as three sets: the set of layer units themselves, the set of dependency
relations for each individual layer unit, and the set of conflict relations (C) that define
which layer units cannot be integrated into the same file system. This collection can
115
be viewed as a directed dependency graph connecting layer units to the layer units
on which they depend.
A layer unit can be integrated into the VLFS when two principles hold. First,
there must be a set of layer units (I) that fulfills total closure of all the dependencies,
that is, every layer unit in the set has every dependency filled. Second, I × I ∩ C = ∅
must hold, meaning that none of the layer units in I can conflict with each other.
Determining when these principles hold is a problem that has been shown to be
polynomial time reducible to 3-SAT [47, 139]. Because 3-SAT is NP-complete, this
could be very difficult to solve naively, but an optimized Davis-Putnam SAT solver [52]
can be used to solve it efficiently [47].
Even when a layer unit can be integrated into the VLFS, however, there will often
be many sets of implicitly selected layer units that allow this. Strata therefore has to
evaluate which of those sets is the best. Linux distributions already face this problem
and tools have been developed to address it, such as Apt [36] and Smart [98]. Strata
leverages Smart and adopts the same metadata database format that Debian uses for
packages for its own layers, as Smart already knows how to parse it. When Smart
is used with a regular Linux distribution, administrators request that it install or
remove packages and Smart determines whether the operation can succeed and what
is the best set of packages to add or remove to achieve that goal. In Strata, when an
administrator requests that a layer be added to or removed from a template appliance,
Smart also evaluates if the operation can succeed and what is the best set of layers to
add or remove. Instead of acting directly on the contents of the file system, however,
Strata only has to update the template’s LDF with the set of layers to be composed
into the file system.
6.3.3
116
Layer Creation
Strata allows layers to be created in two ways. First, .deb packages used by Debianderived distributions and the .rpm packages used by RedHat-derived distributions
can be directly converted into layers. Strata converts packages into layers in two
steps. First, the relevant metadata from the package is extracted, including its name
and version. Second, the package’s file contents are extracted into a private directory
that will be the layer’s file system components. When using converted packages,
Strata leverages the underlying distribution’s tools to run the configuration scripts
belonging to the newly created layers correctly. Instead of using the distribution’s
tools to unpack the software package, Strata composes the layers together and uses
the distribution’s tools as though the packages have already been unpacked. Although
Strata is able to convert packages from different Linux distributions, it cannot mix
and match them because they are generally ABI incompatible with one another.
More commonly, Strata leverages existing packaging methodologies to simplify the
creation of layers from scratch. In a traditional system, when administrators install
a set of files, they copy the files into the correct places in the file system using the
root of the file system tree as their starting point. For instance, an administrator
might run make install to install a piece of software compiled on the local machine.
In Strata, layer creation is a three-step process. First, instead of copying the files
into the root of the local file system, the layer creator installs the files into their own
specific directory tree. That is, they make a blank directory to hold a new file system
tree that is created by having the make install copy the files into a tree rooted at
that directory, instead of the actual file system root.
Second, the layer maintainer extracts programs that integrate the files into the
underlying file system and creates scripts that run when the layer is added to and
117
removed from the file system. Examples of this include integration with GNOME’s
GConf configuration system, creation of encryption keys, or creation of new local
users and groups for new services that are added. This leverages skills that package
maintainers in a traditional package management world already have.
Finally, the layer maintainer needs to set up the metadata correctly. Some elements of the metadata, such as the name of the layer and its version, are simple to
set, but dependency information can be much harder. But because package management tools have already had to address this issue, Strata is able to leverage the
tools they have built. For example, package management systems have created tools
that infer dependencies using an executable dynamically linking against shared libraries [117]. Instead of requiring the layer maintainer to enumerate each shared
library dependency, we can programmatically determine which shared libraries are
required and populate the dependency fields based on those versions of the library
currently installed on the system where the layer is being created.
6.3.4
Layer Repositories
Strata provides local and remote layer repositories. Local layer repositories are provided by locally accessible file system shares made available by a SAN. They contain
layer units to be composed into the VLFS. This is similar to a regular virtualization
infrastructure in which all the virtual machines’ disks are stored on a shared SAN.
Each layer unit is stored as its own directory; a local layer repository contains a set of
directories, each of which corresponds to a layer unit. The local layer repository’s contents are enumerated in a database file providing a flat representation of the metadata
of all the layer units present in the repository. The database file is used for making
a list of what layers can be installed and their dependency information. By storing
118
the shared layer repository on the SAN, Strata lets layers be shared securely among
different users’ appliances. Even if the machine hosting the VLFS is compromised,
the read-only layers will stay secure, as the SAN will enforce the read-only semantic
independently of the VLFS.
Remote layer repositories are similar to local layer repositories, but are not accessible as file system shares. Instead, they are provided over the Internet, by protocols
such as FTP and HTTP, and can be mirrored into a local layer repository. Instead of
mirroring the entire remote repository, Strata allows on-demand mirroring, where all
the layers provided by the remote repository are accessible to the VAs, but must be
mirrored to the local mirror before they can be composed into a VLFS. This allows
administrators to store only the needed layers while maintaining access to all the
layers and updates that the repository provides. Administrators can also filter which
layers should be available to prevent end users from using layers that violate administration policy. In general, an administrator will use these remote layer repositories
to provide the majority of layers, much as administrators use a publicly managed
package repository from a regular Linux distribution.
Layer repositories let Strata operate within an enterprise environment by handling
three distinct yet related issues. First, Strata has to ensure that not all end users
have access to every layer available within the enterprise. For instance, administrators may want to restrict certain layers to certain end users for licensing, security or
other corporate policy reasons. Second, as enterprises get larger, they gain levels of
administration. Strata must support the creation of an enterprise-wide policy while
also enabling small groups within the enterprise to provide more localized administration. Third, larger enterprises supporting multiple operating systems cannot rely
on a single repository of layers because of inherent incompatibilities among operating
systems.
119
By allowing a VLFS to use multiple repositories, Strata solves these three problems. First, multiple repositories let administrators compartmentalize layers according to the needs of their end users. By providing end users with access only to
needed repositories, organizations prevent their end users from using the other layers. Second, by allowing sub-organizations to set up their own repositories, Strata
lets a sub-organization’s administrator provide the layers that end users need without requiring intervention by administrators of global repositories. Finally, multiple
repositories allow Strata to support multiple operating systems, as each distinct operating system has its own set of layer repositories. Strata supports multiple layer
repositories by providing a directory of layer repositories that can contain multiple
subdirectories, each of which serves as a mount point for a layer repository file system
share, or as a location to store the layers themselves locally. This enables administrators to use regular file system share controls to determine which layer repositories
users can access.
6.3.5
VLFS Composition
To create a VLFS, Strata has to solve a number of file system-related problems. First,
Strata has to support the ability to combine numerous distinct file system layers into a
single static view. This is equivalent to installing software into a shared read-only file
system. Second, because users expect to treat the VLFS as a normal file system, for
instance, by creating and modifying files, Strata has to let VLFSs be fully modifiable.
Similarly, users must also be able to delete files that exist on the read-only layer.
By basing the VLFS on top of unioning file systems [102, 150], Strata solves all
these problems. Unioning file systems join multiple layers into a single namespace.
Unioning file systems have been extended to apply attributes such as read-only and
120
read-write to their layers. The VLFS leverages this property to force shared layers
to be read-only, while the private layer remains read-write. If a file from a shared
read-only layer is modified, it is copied-on-write (COW) to the private read-write
layer before it is modified. For example, LiveCDs use this functionality to provide
a modifiable file system on top of the read-only file system provided by the CD.
Finally, unioning file systems use whiteouts to obscure files located on lower layers.
For example, if a file located on a read-only layer is deleted, a whiteout file will be
created on the private read-write layer. This file is interpreted specially by the file
system and is not revealed to the user while also preventing the user from seeing files
with the same name.
However, Strata has to solve two additional problems. First, Strata must maintain the usage semantic that users can recover deleted system files by reinstalling or
upgrading the layer that contains them. For example, in a traditional monolithic file
system managed by a package management system, reinstalling a package will replace
any files that might have been deleted. However, if the VLFS only used a traditional
union file system, the whiteouts stored in the private layer would persist and continue
to obscure the file even if the shared layer was replaced.
To solve this problem, Strata provides a VLFS with additional writeable layers
associated with each read-only shared layer. Instead of containing file data, as does
the topmost private writeable layer, these layers just contain whiteout marks that will
obscure files contained within their associated read-only layer. The user can delete a
file located in a shared read-only layer, but the deletion only persists for the lifetime
of that particular instance of the layer. When a layer is replaced during an upgrade or
reinstall, a new empty whiteout layer will be associated with the replacement, thereby
removing any preexisting whiteouts. In a similar way, Strata handles the case where
a file belonging to a shared read-only layer is modified and therefore copied to the
121
VLFS’s private read-write layer. Strata provides a revert command that lets the
owner of a file that has been modified revert the file to its original pristine state.
While a regular VLFS unlink operation would have removed the modified file from
the private layer and created a whiteout mark to obscure the original file, revert
only removes the copy in the private layer, thereby revealing the original below it.
Second, Strata supports adding and removing layers dynamically without taking
the file system offline. This is equivalent to installing, removing or upgrading a
software package while a monolithic file system is online. While some upgrades,
specifically of the kernel, will require the VA to be rebooted, most should be able to
occur without taking the VA offline. However, if a layer is removed from a union,
its data is effectively removed as well because unions operate only on file system
namespaces and not on the data the underlying files contain. If an administrator
wants to remove a layer from the VLFS, they must take the VA offline, because
layers cannot be removed while in use.
To solve this problem, Strata emulates a traditional monolithic file system. When
an administrator deletes a package containing files in use, the processes that are
currently using those files will continue to work. This occurs by virtue of unlink’s
semantic of first removing a file from the file system’s namespace, and only removing
its data after the file is no longer in use. This lets processes continue to run because
the files they need will not be removed until after the process terminates. This
creates a semantic in which a currently running program can be using versions of files
no longer available to other programs.
Existing package managers use this semantic to allow a system to be upgraded
online, and it is widely understood. Strata applies the same semantic to layers. When
a layer is removed from a VLFS, Strata marks the layer as unlinked, removing it from
the file system namespace. Although this layer is no longer part of the file system
122
namespace and thus cannot be used by any operations such as open that work on the
namespace, it does remain part of the VLFS, enabling data operations such as read
and write to continue working correctly for previously opened files.
6.4
Improving Appliance Security
In today’s world, machines are continually attacked and administrators work hard
to deflect the attacks. But even with an administrator’s best efforts, attacks still
succeed from time to time. A main problem in dealing with possibly compromised
machines is detecting whether they have indeed been compromised. Just because an
attack is detected does not mean that the attacker was able to change the machine in
a persistent way. Many administrators employ additional tools such as Tripwire [79]
to aid in this effort, but this creates an added burden. There are extra tools and
databases to be maintained and possibly neglected. This leaves the administrators
not always knowing what, or if, the attacker modified. A clean reinstall is often the
best option, but this causes two problems: downtime and lost data. Although an
administrator can back up the system before it is reinstalled, this further adds to the
time lost to repairs.
To address these problems, Strata not only manages appliances, but also keeps
them more secure, improves compromise detection, and makes it easier to fix compromised machines. Strata does this in three fundamental ways. First, many machines
are exploited because they provide functionality that is not needed and therefore
not maintained appropriately. Strata improves auditing by allowing an administrator
to examine each VLFS configuration to determine if unneeded layers, and therefore pieces of software, are being included. As opposed to a traditional monolithic
file system, where files can become hidden among their peers, a VLFS enables an
123
administrator to determine easily which layers are included and isolate file system
modifications stored in the private read-write layer.
Similarly, in the face of an attempted compromise, the VLFS lets an administrator
determine quickly if the file system has been compromised simply by checking the file
system’s private layer. Because any changes made to the file system cause a change to
the private read-write layer, an administrator can see if any system binaries or libraries
have been copied up to the private layer. If this has occurred, the administrator
knows that the system has been maliciously modified. The attacker has no ability
to modify the shared read-only layers because the layer repository’s file system share
enforces the read-only access to the shared contents. To modify the contents in
the shared layer repositories, an attacker would have to find a way to attack the
file system share itself. Although the attacker can still modify the appliance’s file
system, administrators can easily tell that this has happened by noticing the system
files stored within the VLFS’s private read-write layer. Administrators can detect
these modifications without relying on external databases that have to be maintained
separately and updated whenever the file system is changed.
Second, by leveraging Strata’s layer concept, an administrator can deploy fixes
to all of the machines more quickly, without having to worry about machines not
currently running or forgotten altogether. When a layer update is available to fix a
security hole, an administrator needs only to import it into the local layer repository.
Systems managed by Strata will detect that the layer repository has been updated
and identify that updates are available for a layer that is being used in the local
VLFS. Strata will automatically include the new layer into the VLFS’s namespace
while removing the old one.
Finally, with a VLFS, it is simple to recreate a fresh system. By replacing the
compromised private layer with a fresh layer, the system is instantly cleaned. This is
124
equivalent to deploying a new virtual appliance, as the private layer is what distinguishes virtual appliance clones. As opposed to physical systems, where reinstalling
the system can require overwriting the compromised system, cleaning a system with
Strata does not require losing the contents of the compromised machine. Because
cleaning the system does not require getting rid of the compromised private layer,
an administrator need not waste time backing it up and can make it available within
the appliance’s file system as a regular directory without it being composed into the
normal file system view. This can puts the system back online quickly while also
allowing easy import of data to be preserved from the compromised system.
Quickly fixing compromised systems is useful, but often results in discarding the
authorized configuration changes made to that system. Until now, we have described
a single VLFS containing multiple read-only layers shared among appliances and one
read-write layer containing the virtual appliance’s private data. But the appliance’s
private data need not be limited to a single layer. An end user of a deployed appliance
can create their own configuration layers to lock in whatever persistent configuration
changes they desire. Regular configuration layers are read-only and shared between
appliances, but this configuration layer is read-only and accessible only to the local
appliance. In practice, the end user will initially create a VLFS as described above
that has only one read-write layer for private data. Configuration changes are usually
done at the outset and remain static for an extended period, so static configuration
changes can be confined to this private layer. When the user is satisfied with the
configuration, they convert the read-write private layer to a read-only configuration
layer to lock it in, while adding a new private layer to contain the file system changes
that occur during regular usage. If the machine’s configuration is corrupted due to
system compromise or an administrator’s authorized changes, the user can quickly
revert back to the locked down configuration, kept as it is on a read-only layer.
6.5
125
We have implemented Strata as a loadable kernel module on an unmodified Linux 2.6
series kernel. The loadable kernel module implements Strata’s VLFS as a stackable
file system. We present experimental results using our Linux prototype to manage
various VAs, demonstrating its ability to reduce management costs while incurring
only modest performance overhead. Experiments were conducted on VMware ESX
3.0 running on an IBM BladeCenter with 14 IBM HS20 eServer blades with dual
3.06 GHz Intel Xeon CPUs, 2.5 GB RAM, and a Q-Logic Fibre Channel 2312 host
bus adapter connected to an IBM ESS Shark SAN with 1 TB of disk space. The
blades were connected by a gigabit Ethernet switch. This is a typical virtualization
infrastructure in an enterprise computing environment where all virtual machines are
centrally stored and run. We compare plain Linux VMs with a virtual block device
stored on the SAN and formatted with the Ext3 file system to VMs managed by
Strata with the layer repository also stored on the SAN. By storing both the plain
VM’s virtual block device and Strata’s layers on the SAN, we eliminate any differences
in performance due to hardware architecture.
To measure management costs, we quantify the time taken by two common tasks,
provisioning and updating VAs. We quantify the storage and time costs for provisioning many VAs and the performance overhead for running various benchmarks
using the VAs. We ran experiments on five VAs: an Apache web server, a MySQL
SQL server, a Samba file server, an SSH server providing remote access, and a remote
desktop server providing a complete GNOME desktop environment. While the server
VAs had relatively few layers, the desktop VA has very many layers. This enables
the experiments to show how the VLFS performance scales as the number of layers
increases. To provide a basis for comparison, we provisioned these VAs using the
126
normal VMware virtualization infrastructure and plain Debian package management
tools, and Strata. To make a conservative comparison to plain VAs and to test larger
numbers of plain VAs in parallel, we minimized the disk usage of the VAs. The
desktop VA used a 2 GB virtual disk, while all others used a 1 GB virtual disk.
6.5.1
Reducing Provisioning Times
Table 6.1 shows how long it takes Strata to provision VAs versus regular and COW
copying. To provision a VA using Strata, Strata copies a default VMware VM with
an empty sparse virtual disk and provides it with a unique MAC address. It then
creates a symbolic link on the shared file system from a file named by the MAC
address to the layer definition file that defines the configuration of the VA. When the
VA boots, it accesses the file denoted by its MAC address, mounts the VLFS with
the appropriate layers, and continues execution from within it. To provision a plain
VA using regular methods, we use QEMU’s qemu-img tool to create both raw copies
and COW copies in the QCOW2 disk image format.
Our measurements for all five VAs show that using COW copies and Strata takes
about the same amount of time to provision VAs, while creating a raw image takes
much longer. Creating a raw image for a VAs takes 3 to almost 6 minutes and is
dominated by the cost of copying data to create a new instance of the VA. For larger
VAs, these provisioning times would only get worse. In contrast, Strata provisions
VAs in only a few milliseconds because a null VMware VM has essentially no data to
copy. Layers do not need to be copied, so copying overhead is essentially zero. While
COW images can be created in a similar amount of time, they do not provide any of
the management benefits of Strata, as each new COW image is independent of the
base image from which it was created.
Apache
Plain
184s
Strata
0.002s
QCOW2 0.003s
MySQL
179s
0.002s
0.003s
Samba
183s
0.002s
0.003s
SSH
174s
0.002s
0.003s
127
Desktop
355s
0.002s
0.003s
Table 6.1 – VA Provisioning Times
VM Wake
Network
Update
Suspend
Total
Plain
14.66s
43.72s
10.22s
3.96s
73.2s
Strata
NA
NA
1.041s
NA
1.041s
Table 6.2 – VA Update Times
6.5.2
Reducing Update Times
Table 6.2 shows how long it takes to update VAs using Strata versus traditional
package management. We provisioned ten VA instances each of Apache, MySQL,
Samba, SSH and Desktop for a total of 50 provisioned VAs. All were kept in a
suspended state. When a security patch [146] was available for the tar package,
installed in all the VAs, we updated them. Strata simply updates the layer definition
files of the VM templates, which it can do even when the VAs are not active. When
the VA is later resumed during normal operation, it automatically checks to see if
the layer definition file has been updated and updates the VLFS namespace view
accordingly, an operation that is measured in microseconds. To update a plain VA
using normal package management tools, each VA instance must be resumed and
retrieve a network address. An administrator or script must ssh into each VA, fetch
and install the updated packages, and finally re-suspend the VA.
Table 6.2 shows the average time to update each VA using traditional methods versus Strata. We break down the update time into times to resume the VM, get access
to the network, actually perform the update, and re-suspend the VA. The measurements show that the cost of performing an update is dominated by the management
128
overhead of preparing the VAs to be updated. Preparation is itself dominated by
getting an IP address and becoming accessible on a busy network. While this cost is
not excessive on a quiet network, on a busy network it can take a significant amount
of time for the client to get a DHCP address, and for the ARP tables on the machine
controlling the update to find the target machine. In our test, the average total time
to update each plain VA is about 73 seconds. In contrast, Strata takes only a second
to update each VA. As this is an order of magnitude shorter even than resuming the
VA, Strata is able to delay the update to a point when the VA will be resumed from
standby normally without impacting its ability to quickly respond. Strata provides
over 70 times faster update times than traditional package management when managing even a modest number of VAs. Strata’s ability to decrease update times would
only improve as the number of VAs being managed grows.
6.5.3
Reducing Storage Costs
Figure 6.6 shows the total storage space required for different numbers of VAs stored
with raw and COW disk images versus Strata. We show the total storage space for 1
Apache VA, 5 VAs corresponding to an Apache, MySQL, Samba, SSH, and Desktop
VA, and 50 VAs corresponding to ten instances of each of the five VAs. As expected,
for raw images, the total storage space required grows linearly with the number of
VA instances. In contrast, the total storage space using COW disk images and Strata
is relatively constant and relatively independent of the number of VA instances. For
one VA, the storage space required for the disk image is less than the storage space
required for Strata, as the layer repository used contains more layers than those used
by any one of the VAs. In fact, to run a single VA, the layer repository size could be
trimmed down to the same size as the traditional VA.
100000.0
129
Plain VM
Strata
Size (MB)
10000.0
1000.0
100.0
10.0
1.0
1 VM
5 VMs
50 VMs
Figure 6.6 – Storage Overhead
For larger numbers of VAs, however, Strata provides a substantial reduction in the
storage space required, because VAs share layers and do not require duplicate storage.
For 50 VAs, Strata reduces the storage space required by an order of magnitude over
the raw disk images. Table 6.3 shows that there is much duplication among the VAs,
as the layer repository of 405 distinct layers needed to build the different VLFSs for
multiple services is basically the same size as the largest service. Although initially
Strata does not have an significant storage benefit over COW disk images, as each
COW disk image is independent from the version it was created from, it now must be
managed independently. This increases storage usage, as the same updates must be
independently applied to many independent disk images. While other mechanisms
exist, such as deduplication, that can help with storage usage, they increase overhead
due to the effort that is required to find duplicates. Additionally, deduplication does
not help with the management of the individuals VAs as the updates will still have
to be applied to each individual system independently.
Repo
1.8GB
# Layer
Shared
Unique
Apache
217MB
43
191MB
26MB
MySQL
206MB
23
162MB
44MB
Samba
169MB
30
152MB
17MB
SSH
127MB
12
123MB
4MB
130
Desktop
1.7GB
404
169MB
1.6GB
Table 6.3 – Layer Repository vs. Static VAs
6.5.4
Virtualization Overhead
To measure the virtualization cost of Strata’s VLFS, we used a range of microbenchmarks and real application workloads to measure the performance of our Linux
Strata prototype, then compared the results against vanilla Linux systems within a
virtual machine. The virtual machine’s local file system was formatted with the Ext3
file system and given read-only access to a SAN partition formatted with Ext3 as
well. We performed all benchmarks in every scenario described above.
To demonstrate the effect that Strata’s VLFS has on system performance, we
performed a number of benchmarks. Postmark [76] is a synthetic test that measures
how the system would behave if used as a mail server. Our postmark test operated on
files between 512 and 10K bytes to simulate the mail-server’s spool directory, with an
initial set of 20,000, and performed 200,000 transactions. Postmark is very intensive
on a few specific file system operations such as lookup(), create() and unlink()
because it is constantly creating, opening and removing files. Figure 6.7 shows that
running this benchmark within a traditional VA is significantly faster than running
it in Strata. This is because as Strata composes multiple file system namespaces
together, it places significant overhead on namespace operations such as lookup().
To demonstrate that postmark’s results are not indicative of performance in reallife scenarios, we ran two application benchmarks to measure the overhead Strata
imposes in a desktop and server VA scenario. First, we timed a multi-threaded build
of the Linux 2.6.18.6 kernel with two concurrent jobs using the VM’s two CPUs. In
1800.0
1600.0
131
Plain VM
Strata
1400.0
Time (s)
1200.0
1000.0
800.0
600.0
400.0
200.0
0.0
Apache
MySQL
Samba
SSH
Desktop
Figure 6.7 – Postmark Overhead in Multiple VAs
all scenarios, we added the layers required to build a kernel to the layers needed to
provide the service, generally adding 8 additional layers to each case. Figure 6.8
shows that while Strata imposes a slight overhead on the kernel build compared to
the underlying file system it uses, the cost is minimal, under 5% at worst.
Second, we measured the amount of HTTP transactions that were able to be
completed per second to an Apache web server placed under load. We imported
the database of a popular guitar tab search engine and used the http load [108]
benchmark to continuously performed a set of 20 search queries on the database
for 60 seconds. For each case that did not already contain Apache, we added the
appropriate layers to the layer definition file to make Apache available. Figure 6.9
shows that Strata imposes a minimal overhead of only 5%.
600.0
132
Plain VM
Strata
500.0
Time (s)
400.0
300.0
200.0
100.0
0.0
Apache
MySQL
Samba
SSH
Desktop
Figure 6.8 – Kernel Build Overhead in Multiple VAs
6.6
Related Work
The most common way to provision and maintain machines today is using the package
management system built into the operating system [4, 56]. Package managers view
the file system into which they install packages as a simple container for files, not
as a partner in the management of the machine. This causes them to suffer from a
number of flaws in their management of large numbers of VAs. They are not space- or
time-efficient, as each provisioned VA needs an independent copy of the package’s files
and requires time-consuming copying of many megabytes or gigabytes into each VA’s
file system. These inefficiencies affect both provisioning and updating of a system
because a lot of time is spent downloading, extracting and installing the individual
packages into the many independent VAs.
As the package manager does not work in partnership with the file system, the file
system is unable to distinguish the different types of files it contains. A file installed
20.0
133
Plain VM
Strata
Fetches/s
15.0
10.0
5.0
0.0
Apache
MySQL
Samba
SSH
Desktop
Figure 6.9 – Apache Overhead in Multiple VAs
from a package and a file modified or created in the course of usage are indistinguishable. Specialized tools are needed to traverse the entire file system to determine
if a file belongs to a package or was created or modified after the package was installed. For instance, to determine if a VA has been compromised, an administrator
must determine if any system files have been modified. Finally, package management
systems work in the context of a running system to modify the file system directly.
These standard tools often do not work outside the context of a running system, for
example, for a VA that is suspended or turned off.
For local scenarios, the size and time efficiencies of provisioning a VA can be
improved by utilizing copy-on-write (COW) disks, such as QEMU’s QCOW2 [91]
format. These enables VAs to be provisioned quickly, as little data has to be written
to disk immediately due to the COW property. However, once provisioned, each
COW copy is now fully independent from the original, is equivalent to a regular copy,
134
and therefore suffers from all the same maintenance problems as a regular VA. Even
if the original disk image is updated, the changes would be incompatible with the
cloned COW images. This is because COW disks operate at the block level. As files
get modified, they use different blocks on their underlying device. Therefore, it is
likely that the original and cloned COW images address the same blocks for different
pieces of data. For similar reasons, COW disks do not help with VA creation, as
multiple COW disks cannot be combined together into a single disk image.
Both the Collective [41] and Ventana [103] attempt to solve the VA maintenance
problem by building upon COW concepts. Both systems enable VAs to be provisioned
quickly by performing a COW copy of each VA’s system file system. However, they
suffer from the fact that they manage this file system at either the block device
or monolithic file system level, providing users with only a single file system. While
ideally an administrator could supply a single homogeneous shared image for all users,
in practice, users want access to many heterogeneous images that must be maintained
independently and therefore increase the administrator’s work. The same is true for
VAs provisioned by the end user, while they both enable the VAs to maintain a
separate disk from the shared system disk that persists beyond upgrades.
Mirage [121] attempts to improve the disk image sprawl problem by introducing a
new storage format, the Mirage Index Format (MIF), to enumerate what files belong
to a package. However, it does not help with the actual image sprawl in regard to
machine maintenance, because each machine reconstituted by Mirage still has a fully
independent file system, as each image has its own personal copy. Although each
provisioned machine can be tracked, they are now independent entities and suffer
from the same problems as a traditional VA.
Stork [38] improves on package management for container-based systems by enabling containers to hard link to an underlying shared file system so that files are only
135
stored once across all containers. By design, it cannot help with managing independent machines, virtual machines, or VAs, because hard links are a function internal
to a specific file system and not usable between separate file systems.
Union file systems [102, 150] provide the ability to compose multiple different file
system namespaces into a single namespace view. Unioning file systems are commonly
used to provide a COW file system from a read-only copy, such as with LiveCDs. However, unioning file system by themselves do not directly help with VA management,
as the underlying file system has to be maintained using regular tools. Strata builds
upon and leverages this mechanism by improving its ability to handle deleted files as
well as managing the layers that belong to the union. This allows Strata to provide
a solution that enables efficient provisioning and management of VAs.
Chapter 7
Apiary: A Desktop of Isolated
Applications
In today’s world of highly connected computers, desktop security and privacy are
major issues. Desktop users interact constantly with untrusted data they receive
from the Internet by visiting new websites, downloading files and emailing strangers.
All these activities use information whose safety the user cannot verify. Data can be
constructed maliciously to exploit bugs and vulnerabilities in applications, enabling
attackers to take control of users’ desktops. For example, a major flaw was recently
discovered in Adobe Acrobat products that enables an attacker to take control of a
desktop when a maliciously constructed PDF file is viewed [18]. Adobe’s estimate to
release a fix was nearly a month after the exploit was released into the wild. Even
in the absence of bugs, untrusted data can be constructed to invade users’ privacy.
For example, cookies are often stored when visiting websites that allow advertisers to
track user behavior across multiple websites.
The prevalence of untrusted data and buggy software makes application fault
containment increasingly important. Many approaches have been proposed to isolate
Chapter 7. Apiary: A Desktop of Isolated Applications
137
applications from one another using mechanisms such as process containers [7, 116]
or virtual machines [147]. For instance, in Chapter 5, we introduced PeaPod to
leverage process containers to isolate the components of a single application. Faults
are confined so that if an application is compromised, only that application and the
data it can access are available to an attacker. By having only one application percontainer, each individual container becomes a simpler system, making it easier to
determine if unwanted processes are running within it.
However, existing approaches to isolating applications suffer from an unresolved
tension between ease of use and degree of fault containment. Some approaches [72,92]
provide an integrated desktop feel but only partial isolation. They are relatively easy
to use, but do not prevent vulnerable applications from compromising the system
itself. Other approaches [122, 143] have less of an integrated desktop feel but fully
isolate applications into distinct environments, typically by using separate virtual
machines. These approaches effectively limit the impact of compromised applications,
but are harder to use because users are forced to manage multiple desktops. Virtual
machine (VM) approaches also require managing multiple machine instances and
incur high overhead to support multiple operating system instances, making them
too expensive to allow more than a couple of fault containment units per-desktop.
To address these problems, we introduce Apiary, which provides strong isolation
for robust application fault containment while retaining the integrated look, feel and
ease of use of a traditional desktop environment. Apiary accomplishes this by using well-understood technologies like thin clients, operating system containers and
unioning file systems in novel ways. It does this using three key mechanisms.
First, it decomposes a desktop’s applications into isolated containers. Each container is an independent software appliance that provides all system services an application needs to execute. To retain traditional desktop semantics, Apiary integrates
138
these containers in a controlled manner at the display and file system. Apiary’s
containers prevent an exploit from compromising the user’s other applications. For
example, by having separate web browser and personal finance containers, any compromise from web browsing would not be able to access personal financial information.
At the same time, Apiary makes the web browser and personal finance containers look
and feel like part of the same integrated desktop, with all normal windowing functions
and cut-and-paste operations operating seamlessly across containers.
Second, it introduces the concept of ephemeral containers. Ephemeral containers
are execution environments with no access to user data that are quickly instantiated
from a clean state for only a single application execution. When the application
terminates, the container is archived, but never used again. Apiary uses ephemeral
containers as a fundamental building block of the integrated desktop experience while
preventing contamination across containers. For example, users often expect to view
PDF documents from the web, but need separate web browser and PDF viewer containers for fault containment. If a user always views PDF documents in the same
PDF viewer container, a single malicious document could exploit the container and
have access to future documents the user wants to keep private, like bills and bank
statements. Instead, Apiary enables the web browser to automatically instantiate a
new ephemeral PDF viewer container for each individual PDF document. Even if the
PDF file is malicious, it will have no effect on other PDF files because the container
instance it exploited will never be used again.
As illustrated by this PDF example, ephemeral containers have three benefits.
First, they prevent compromises, because exploits, even if triggered, cannot persist.
Second, they protect users from compromised applications. Even when an application
has been compromised, a new ephemeral container running that application in parallel
will remain uncompromised because it is guaranteed to start from a clean state.
139
Third, they help protect user privacy when using the Internet. For example, while
cookies must be accepted to use many websites, web browsers in separate ephemeral
containers can be used for different websites to prevent cookies from tracking user
behavior across websites.
Apiary’s third mechanism is Strata’s VLFS. Apiary leverages the VLFS to allow
the many application containers used in Apiary to be efficiently stored and instantiated. Since each container’s VLFS will share the layers that are common to them,
Apiary’s storage requirements are the same as a traditional desktop. Similarly, since
no data has to be copied to create a new VLFS instance, Apiary is able to quickly
instantiate ephemeral containers for a single application execution.
Apiary’s approach differs markedly from the approach taken by PeaPod in Chapter 5. In PeaPod, we isolate the different process components of a single larger application, such as an email server. These applications contain processes that require
access to large amounts of the same data, but with differing levels of privilege and
therefore they cannot be fully isolated. Furthermore, in many of these applications,
the security model is well understood and therefore simple sets of rules can be created
to isolate each component. However, desktop security is much more complicated. As
can be seen in Chapter 5.4.3, just isolating one small portion of the desktop involved
the creation of the largest set of rules. In Apiary, we enable the isolation of desktop
applications without any rules.
7.1
Apiary Usage Model
Figure 7.1 shows the Apiary desktop. It looks and works like a regular desktop. Users
launch programs from a menu or from within other programs, switch among launched
programs using a taskbar, interact with running programs using the keyboard and
140
Figure 7.1 – Apiary screenshot showing a desktop session. At the the topmost left is
(1), an application menu that provides access to all available applications. Just below
it, the window list (2) allows users to easily switch among running applications. (3) is
the composite display view of all the visible running applications.
mouse, and have a single display with an integrated window system and clipboard
functionality that contains all running programs.
Although Apiary provides a look and feel similar to a regular desktop, it provides
fault containment by isolating applications into separate containers. Containers enforce isolation so that applications running inside cannot get out. Apiary isolates
individual applications, not individual programs. An application in Apiary can be
understood as a software appliance made up of multiple programs used together in a
single environment to accomplish a specific task. For instance, a user’s web browser
and word processor would be considered separate applications and isolated from one
another. The software appliance model means that users can install separate isolated
141
applications containing many or all of the same programs, but used for different purposes. For example, a banking application contains a web browser for accessing a
bank’s website, while a web surfing application also contains a web browser, but for
general web browsing. Both appliances make use of the same web browser program,
but are listed as different applications in the application menu.
Apiary provides two types of containers: ephemeral and persistent. Ephemeral
containers are created fresh for each application execution. Persistent containers, like
a traditional desktop, maintain their state across application executions. Apiary lets
users select whether an application should launch within an ephemeral or a persistent container. Windows belonging to ephemeral applications are, by default, given
distinct border colors so that users can quickly identify based on appearance in which
mode an application is executing.
Ephemeral containers provide a powerful mechanism for protecting desktop security and user privacy when running common desktop operations, such as viewing
untrusted data, that do not require storing persistent states. Users will typically
run multiple ephemeral containers at the same time, and, in some cases, multiple
ephemeral containers for the same application at the same time. They provide important benefits for a wide range of uses.
Ephemeral containers prevent compromises because exploits cannot persist. For
example, a malicious PDF document that exploits an ephemeral PDF viewer will
have no persistent effect on the system because the exploit is isolated in the container
and will disappear when the container finishes executing.
Ephemeral containers protect user privacy when using the Internet. For example,
many websites require cookies to function, but also store advertisers’ cookies to track
user behavior across websites and compromise privacy. Apiary makes it easy to
use multiple ephemeral web browser containers simultaneously, each with separate
142
cookies, making it harder to track users across websites.
Ephemeral containers protect users from compromises that may have already occurred on their desktop. If a web browser has been compromised, parallel and future
uses of the web browser will allow an attacker to steal sensitive information when
the user accesses important websites (e.g., for banking). Ephemeral containers are
guaranteed to launch from a clean slate. By using a separate ephemeral web browser
container for accessing a banking site, Apiary ensures that an already exploited web
browser installation cannot compromise user privacy.
Ephemeral containers allow applications to launch other applications safely. For
example, users often receive email attachments such as PDF documents that they
wish to view. To avoid compromising an email container, Apiary creates a separate
ephemeral PDF viewer container for the PDF. Even if it is malicious, it will have
no effect on the user’s desktop, as it only affects the isolated ephemeral container.
Similarly, ephemeral word processor or spreadsheet containers will be created for
viewing these email attachments to prevent malicious files from compromising the
system. In general, Apiary allows applications to cause other applications to be
safely launched in ephemeral containers by default to support scenarios that involve
multiple applications.
Isolated persistent containers are necessary for applications that maintain state
across executions to prevent a single application compromise from affecting the entire
system. Users typically run one persistent container per-application to avoid needing
to track which persistent application container contains which persistent information.
Some applications only run in persistent containers, while others may run in both
types of containers. For example, an email application is typically used in a persistent
container to maintain email state across executions. On the other hand, a web browser
will be used both in a persistent container, to access a user’s trusted websites, and in
143
an ephemeral container, to view untrusted websites. Similarly, a browser may be used
in a persistent container to remember browsing history, plugins and bookmarks, but
may also be used in an ephemeral container when accessing untrusted websites. Note
that files stored in both kinds of containers are private by default and not accessible
outside their container.
Apiary’s containers work together to provide a security system that differs fundamentally from common security schemes that attempt to lock down applications
within a restricted-privilege environment. In Apiary, each application container is
an independent entity that is entirely isolated from every other application container
on the Apiary desktop. One does not have to apply any security analysis or complex isolation rules to determine which files a specific application should be able to
access. Also, in most other schemes, an application, once exploited, will continue
to be exploited, even if the exploited application is restricted from accessing other
applications’ data. Apiary’s ephemeral containers, however, prevent an exploit from
persisting between application execution instances.
Apiary provides every desktop with two ways to share files between containers.
First, containers can use standard file system share concepts to create directories
that can be seen by multiple containers. This has the benefit of allowing any data
stored in the shared directory to be automatically available to the other containers
that have access to the share. Second, Apiary supplies every desktop with a special
persistent container with a file explorer. The explorer has access to all of the user’s
containers and can manage all of the user’s files, including copying them between
containers. This is useful if a user decides they want to preserve a file from an
ephemeral container, or move a file from one persistent container to another, as, for
instance, when emailing a set of files. The file explorer container cannot be used in an
ephemeral manner, its functionality cannot be invoked by any other application on
144
the system, and no other application is allowed to execute within it. This prevents an
exploited container from using the file explorer container to corrupt others. Note that
both of these mechanisms break the isolation barrier that exists between containers.
File system shares can be used by an exploited container as a vector to infect other
containers, while a user can be tricked into moving a malicious file between containers.
However, this is a tension that will always exist in security systems that are meant
to be usable by a diverse crowd of users.
7.2
Apiary Architecture
To support its container model, Apiary must have four capabilities. First, Apiary
must be able to run applications within secure containers to provide application isolation. Second, Apiary must provide a single integrated display view of all running
applications. Third, Apiary must be able to instantiate individual containers quickly
and efficiently. Finally, for a cohesive desktop experience, Apiary must allow applications in different containers to interact in a controlled manner.
Apiary does this by using a virtualization architecture that consists of three main
components: an operating system container that provides a virtual execution environment, a virtual display system that provides a virtual display server and viewer
and the VLFS. Additionally, Apiary provides a desktop daemon that runs on the
host. This daemon instantiates containers, manages their lifetimes and ensures that
they are correctly integrated.
7.2.1
Process Container
Apiary’s containers are essential to Apiary’s ability to isolate applications from one
another. By providing isolated containers, individual applications can run in parallel
145
within separate containers, and have no conception that there are other applications
running. This enforces fault containment, as an exploited process will only have
access to whatever files are available within its own container.
Apiary’s containers leverage features such as Solaris’s zones [116], FreeBSD’s
jails [74] and Linux’s containers [6] to create isolated execution environments. Each
container has its own private kernel namespace, file system and display server, providing isolation at the process, file system and display levels. Programs within separate
containers can only interact using normal network communication mechanisms. In
addition, each container has an application control daemon that enables the virtual
display viewer to query the container for its contents and interact with it.
7.2.2
Display
Apiary’s virtual display system is crucial to complete process isolation and a cohesive desktop experience. If containers were to share a single display directly, malicious applications could leverage built-in mechanisms in commodity display architectures [61, 93] to insert events and messages into other applications that share the
display, enabling the malicious application to remotely control the others, effectively
exploiting them as well. Many existing commodity security systems do not isolate
applications at the display level, providing an easy vector for attackers to further
exploit applications on the desktop.
But although independent displays isolate the applications from one another, they
do not provide the single cohesive display users expect. This cohesive display has two
elements. First, the display views have to be integrated into a single view. Second,
Apiary has to provide the normal desktop metaphors that users want, including a
single menu structure for launching applications and an integrated task switcher that
146
allows the user to switch among all running applications.
Apiary’s virtual display system incorporates both of these elements. First, Apiary’s virtual display provides each container with its own virtual display similar to
existing systems [14, 27, 51, 140]. This virtual display operates by decoupling the
display state from the underlying hardware and enabling the display output to be
redirected anywhere.
Second, Apiary enables these independent displays to be integrated into a single
display view. While a regular remote framework provides all the information needed
to display each desktop, it assumes that there is no other display in use, and therefore expects to be able to draw the entire display area. In Apiary, where multiple
containers are in use, this assumption does not hold. Therefore, to enable multiple
displays to be integrated into a single view, the Apiary viewer composes the display
together using the Porter-Duff [107] over operation.
Apiary’s viewer provides an integrated menu system that lists all the applications
users are able to launch. Apiary leverages the application control daemon running
within each container to enumerate all the applications within the container, much
like a regular menu in a traditional desktop. Instead of providing the menu directly
in the screen, however, it transmits the collected data back to the viewer, which then
integrates this information into its own menu, associating the menu entry with the
container it came from. When a user selects a program from the viewer’s menu, the
viewer instructs the correct daemon to execute it within its container.
Similarly, to manage running applications effectively, Apiary provides a single
taskbar with which the user can switch between all applications running within the
integrated desktop. Apiary leverages the system’s ability to enumerate windows and
switch applications [63] by having the daemon enumerate all the windows provided by
its container and transmit this information to the viewer. The viewer then integrates
147
this information into a single taskbar with buttons corresponding to application windows. When the user switches windows using the taskbar, the viewer communicates
with the daemon and instructs it to bring the correct window to the foreground.
Note that by stacking the independent displays, the windowing semantic is changed
slightly from a traditional desktop. In a traditional desktop, when one brings a window to the foreground, only that window will be brought up. In Apiary, each display
can feature multiple windows, each of which can be raised to the foreground. However, in Apiary, bringing up a window also brings its entire display layer to the
foreground. Consequently, all other windows in the display will be raised above the
windows provided by all other displays.
7.2.3
File System
Apiary requires containers to be efficient in storage space and instantiation time. Containers must be storage-efficient to allow regular desktops to support the large number
of application containers used within the Apiary desktop. Containers must be efficiently instantiated to provide fast interactive response time, especially for launching
ephemeral containers. Both of these requirements are difficult to meet using traditional independent file systems for each container. Each container’s file system would
be using its own storage space, which would be inefficient for a large number of containers, as it means many duplicated files. More important, the desktop becomes
much harder to maintain because each independent file system must be updated individually. Similarly, instantiating the container requires copying the file system, which
can include many megabytes or gigabytes of storage space. Copying time prevents
the container from being instantiated quickly. Although file systems that support a
branching semantic [32, 103] can be used to quickly provision a new container’s file
148
system from a template image, each template image will still be independent and
therefore inefficient with regard to space, maintenance and upgrades.
Apiary leverages Strata’s Virtual Layered File System to meet these requirements.
The VLFS enables file systems to be created by composing layers together into a single
file system namespace view. VLFSs are built by combining a set of shared software
layers together in a read-only manner with a per-container private read-write layer.
Multiple VLFSs providing multiple applications are as efficient as a single regular
file system because all common files are stored only once in the set of shared layers.
Therefore, Apiary is able to store efficiently the file systems its containers need. This
also allows Apiary to manage its containers easily. To update every VLFS that uses
a particular layer, the administrator need only replace the single layer containing the
files that need updating. The VLFS also lets Apiary instantiate each container’s file
system efficiently. No data has to be copied into place because each of the software
layers is shared in a read-only manner. The instantiation is transparent to the end
user and nearly instantaneous.
7.2.4
Inter-Application Integration
Apiary provides independent containers for fault containment, but must also ensure
that they do not limit effective use of the desktop. For instance, if Firefox is totally
isolated from the PDF viewer, how does one view a PDF file? The PDF viewer could
be included within the Firefox container, but this violates the isolation that should
exist between Firefox and an application viewing untrusted content. Similarly, users
could copy the file from the Firefox container to the PDF viewer container, but this
is not the integrated feel that users expect.
Apiary solves this problem by enabling applications to execute specific applications
149
in new ephemeral containers. Every application used within Apiary is preconfigured
with a list of programs that it enables other applications to use in an ephemeral
manner. Apiary refers to these as global programs. For instance, a Firefox container
can specify /usr/bin/firefox and a Xpdf container can specify /usr/bin/xpdf
as global programs. Program paths marked global exist in all containers. Apiary
accomplishes this by populating a single global layer, shared by all the container’s
VLFSs, with a wrapper program for each global program. This wrapper program is
used to instantiate a new ephemeral container and execute the requested program
within it. Apiary only allows for the execution in a new ephemeral container and
not in a preexisting persistent or ephemeral container, as that would break Apiary
isolation constraints and cannot be done without risk to the preexisting container.
When executed, the wrapper program determines how it was executed and what
options were passed to it. It connects over the network to the Apiary desktop daemon on the same host and passes this information to it. The daemon maintains a
mapping of global programs to containers and determines which container is being
requested to be instantiated ephemerally. This ensures that only the specified global
programs’ containers will be instantiated, preventing an attacker from instantiating
and executing arbitrary programs. Apiary is then able to instantiate the correct fresh
ephemeral container, along with all the required desktop services, including a display
server. The display server is then automatically connected to the viewer. Finally, the
daemon executes the program as it was initially called in the new container.
To ensure that ephemeral containers are discarded when no longer needed, Apiary’s desktop daemon monitors the process executed within the container. When it
terminates, Apiary terminates the container. Similarly, as the Apiary viewer knows
which containers are providing windows to it, if it determines that no more windows
are being provided by the container, it instructs the desktop daemon to terminate
150
the container. This ensures that an exploited process does not continue running in
the background.
Merely running a new program in a fresh container, however, is not enough to integrate applications correctly. When Firefox downloads a PDF and executes a PDF
viewer, it must enable the viewer to view the file. This will fail because Firefox
and ephemeral PDF viewer containers do not share the same file system. To enable
this functionality, Apiary enables small private read-only file shares between a parent
container and the child ephemeral container it instantiates. Because well-behaved
applications such as Firefox, Thunderbird and OpenOffice only use the system’s temporary file directory to pass files among them, Apiary restricts this automatic file
sharing ability to files located under /tmp. To ensure that there are no namespace
conflicts between containers, Apiary provides containers with their own private directory under /tmp to use for temporary files, and they are preconfigured to use that
directory as their temporary file directory.
But providing a fully shared temporary file directory allows an exploited container
to access private files that are placed there when passed to an ephemeral container.
For instance, if a user downloads a malicious PDF and a bank statement in close
succession, they will both exist in the temporary file directory at the same time. To
prevent this, Apiary provides a special file system that enhances the read-only shares
with an access control list (ACL) that determines which containers can access which
files. By default, these directories will appear empty to the rest of the containers,
as they do not have access to any of the files. This prevents an exploited container
from accessing data not explicitly given to it. A file will only be visible within the
directories if the Apiary desktop daemon instructs the file system to reveal that file by
adding the container to the file’s ACL. This occurs when a global program’s wrapper
is executed and the daemon determines that a file was passed to it as an option. The
151
daemon then adds the ephemeral container to the file’s ACL. Because the directory
structure is consistent between containers, simply executing the requested program
in the new ephemeral container with the same options is sufficient.
Apiary enables the file explorer container discussed in Section 7.1 in a similar
way. The file explorer container is set up like all other containers in Apiary. It is
fully isolated from the rest of the containers and users interact with it via the regular
display viewer. It differs from the rest of the containers in that other containers are
not fully isolated from it. This is necessary as users can store their files in multiple
locations, most notably, the container’s /tmp directory and the user’s home directory.
Apiary’s file explorer provides read-write access to each of these areas as a file share
within the file explorer’s FS namespace. Apiary prevents any executable located
within these file systems from executing with the file explorer container to prevent
malicious programs from exploiting it. Users are able to use normal copy/paste
semantics to move files among containers. While this is more involved than a normal
desktop with only a single namespace, users generally do not have to move files among
containers.
The primary situation in which users might desire to move files between containers
is when interacting with an ephemeral container, as a user might want to preserve
a file from there. For instance, a user can run their web browser in an ephemeral
container to maintain privacy, but also download a file they want to keep. While the
ephemeral container is active, a user can just use the file explorer to view all active
containers. To avoid situations where the user only remembers after terminating the
ephemeral container that it had files they wanted to keep, Apiary archives all newly
created or modified non-hidden files that are accessible to the file explorer when the
ephemeral container terminates. This allows a user to gain access to them even after
the ephemeral container has terminated. Apiary automatically trims this archive if
152
no visible data was stored within the ephemeral container, such as in the case of an
ephemeral web browser that the user only used to view a web page, and did not save
a specific file. Similarly, Apiary provides the user the ability to trim the archive to
remove ephemeral container archives that do not contain data they need.
Apiary also turns the desktop viewer into an inter-process communication (IPC)
proxy that can enable IPC states to be shared among containers in a controlled
and secure manner. This means that only an explicitly allowed IPC state is shared.
For example, one of the most basic ways desktop applications share state is via the
shared desktop clipboard. To handle the clipboard, each container’s desktop daemon
monitors the clipboard for changes. Whenever a change is made to one container’s
clipboard, this update is sent to the Apiary viewer and then propagated to all the
other containers. The Apiary viewer also keeps a copy of the clipboard so that any
future container can be initialized with the current clipboard state. This enables
users to continue to use the clipboard with applications in different containers in a
manner consistent with a traditional desktop. This model can be extended to other
IPC states and operations.
7.3
We have implemented a remote desktop Apiary prototype system for Linux desktop
environments. The prototype consists of a virtual display driver for the X window system that provides a virtual display for individual containers based on MetaVNC [140],
a set of user space utilities that enable container integration and a loadable kernel
module for the Linux 2.6 kernel that provides the ability to create and mount VLFSs.
Apiary uses a Linux container-like mechanism to provide the isolated containers [100]
and the VLFS.
153
Our prototype’s VLFS layer repository contained 214 layers created by converting
the set of Debian packages needed by the set of applications we tested into individual
layers. Using these layers, we are able to create per-application appliances for each individual application by simply selecting which high level applications we want within
the appliance, such as Firefox, with the dependencies between the layers ensuring that
all the required layers are included. Using these appliances, we are able to instantly
provision persistent and ephemeral containers for the applications as needed.
Using this prototype, we used real exploits to evaluate Apiary’s ability to contain
and recover from attacks. We conducted a user study to evaluate Apiary’s ease of
use compared to a traditional desktop. We also measured Apiary’s performance with
real applications in terms of runtime overhead, startup time and storage efficiency.
For our experiments, we compared a plain Linux desktop with common applications
installed to an Apiary desktop that has applications available to be used in persistent
and ephemeral containers. The applications we used are the Pidgin instant messenger,
the Firefox web browser, the Thunderbird email client, the OpenOffice.org office suite,
the MPlayer media player and the Xpdf PDF viewing program. Experiments were
conducted on an IBM HS20 eServer blade with dual 3.06 GHz Intel Xeon CPUs and
2.5 GB RAM. All desktop application execution occurred on the blade. Participants
in the usage study connected to the blade via a Thinkpad T42p laptop with a 1.8
GHz Intel Pentium-M CPU and 2GB of RAM running the MetaVNC viewer.
7.3.1
Handling Exploits
We tested two scenarios that illustrate Apiary’s ability to contain and recover from a
desktop application exploit, as well as explore how different decisions can affect the
security of Apiary’s containers.
7.3.1.1
154
Malicious Files
Many desktop applications have been shown to be vulnerable to maliciously created
files that enable an attacker to subvert the target machine and destroy the data.
These attacks are prevalent on the Internet, as many users will download and view
whatever files are sent to them. To demonstrate this problem, we use two malicious
files [62, 64] that exploit old versions of Xpdf and mpg123 respectively. The mpg123
program was stored within the MPlayer container. The mpg123 exploit works by
creating an invalid mp3 file that triggers a buffer overflow in old versions of mpg123,
enabling the exploit to execute any program it desires. The Xpdf exploit works by
exploiting a behavior of how Xpdf launched helper programs, that is, by passing a
string to sh -c. By including a back-tick (‘ ‘) string within a URL embedded in
the PDF file, an attacker could get Xpdf to launch unknown programs. Both of these
exploits are able to leverage sudo to perform privileged tasks, in this case, deleting the
entire file system. Sudo is exploited because popular distributions require users to use
it to gain root privileges and have it configured to run any applications. Additionally,
sudo, by default, caches the user’s credentials to avoid needing to authenticate the
user each time it needs to perform a privileged action. However, this enables local
exploits to leverage the cached credentials to gain root privileges.
In the plain Linux system, recovering from these exploits required us to spend a
significant amount of time reinstalling the system from scratch, as we had to install
many individual programs, not just the one that was exploited. Additionally, we
had to recover a user’s 23GB home directory from backup. Reinstalling a basic
Debian installation took 19 minutes. However, reinstalling the complete desktop
environment took a total of 50 minutes. Recovering the user’s home directory, which
included multimedia files, research papers, email and many other assorted files, took
155
an additional 88 minutes when transferred over a Gbps LAN.
Apiary protected the desktop and enabled easier recovery. It protected the desktop
by letting the malicious files be viewed within an ephemeral container. Even though
the exploit proceeded as expected and deleted the container’s entire file system, the
damage it caused is invisible to the user, because that ephemeral container was never
to be used again. Even when we permitted the exploit to execute within a persistent
container, Apiary enabled significantly easier recovery from the exploit. As shown in
Table 7.2, Apiary can provision a file system in just a few milliseconds. This is nearly
6 orders of magnitude faster than the traditional method of recovering a system by
reinstallation. Furthermore, Apiary’s persistent containers divide up home directory
content between them, eliminating the need to recover the entire home directory if
one application is exploited.
This also shows how persistent containers can be constructed in a more secure
manner to prevent exploits from harming the user. As a large amount of the above
user’s data, such as media files, is only accessed in a read-only manner, the data can
be stored on file system shares. This enables the user to allow the different containers
to have different levels of access to the share. The file explorer container can access
it in a read-write manner, enabling a user to manage the contents of the file system
share, while the actual applications that view these files can be restricted to accessing
them in a read-only manner, protecting the files from exploits.
7.3.1.2
Malicious Plugins
Applications are also exploited via malware that users are tricked into downloading
and installing. This can be an independent program or a plugin that integrates with
an already-installed application. For example, malicious attackers can try to convince
users to download a “codec” they need to view a video. Recently, a malicious Firefox
156
extension was discovered [31] that leverages Firefox’s extension and plugin mechanism
to extract a user’s banking username and password from the browser when the user
visits their bank’s website and sends the information to the attacker. These attacks
are common because users are badly conditioned to allow a browser to install what
it needs when it asks to install something.
In a traditional environment, this malicious extension persists until its discovered
and removed. As it does not affect regular use of the browser, there is very little to
alert users that they have been attacked. As this exploit is not readily available to
the public, we simulated its presence with the non-malicious Greasemonkey Firefox
extension. Much like the malicious file example, Apiary prevented the extension from
persisting when installed into an ephemeral container. Even when a user allowed the
installation of the extension, it did not persist to future executions of Firefox.
However, this exploit poses a significant risk if it enters the user’s persistent web
browser container. While one might expect Firefox extensions to be uninstallable
through Firefox’s extension manager, this is only true of extensions that are installed
through it. If an extension is installed directly into the file system, it cannot be
uninstalled this way. Although it can be disabled, it must later be removed from the
file system. This applies equally to Apiary and traditional machines. While users can
quickly recreate the entire persistent Firefox container, that requires knowing that the
installation was exploited. Apiary handles this situation more elegantly by allowing
the user to use Firefox in multiple web browsing containers. In this case, we created
a general-purpose web browsing container for regular use, as well as a financial web
browsing container for the bank website only. Apiary refused to install any addons
in the financial web browsing container, keeping it isolated and secure even when the
general-purpose web browsing container was compromised.
Apiary enables the creation of multiple independent application containers, each
157
containing the same application, but performing different tasks, such as visiting a
bank website. Because the great majority of the VLFS’s layers are shared, the user
incurs very little cost for these multiple independent containers. This approach can
be extended to other related but independent tasks, for instance, using a media player
to listen to one’s personal collection of music, as opposed to listening to Internet radio
from an untrusted source.
This scenario also reveals a problem with how plugins and other extensions are
currently handled. When the browser provides its own package management interface
independent of the system’s built-in package manager, this affects impacts Apiary,
because certain application extensions might be needed in an ephemeral container,
but if they are not known to the package manager, they cannot be easily included.
Even today, however, many plugins and browser extensions are globally installable
and manageable via the package manager itself in systems like Debian. In these
systems, this yields the benefit that when multiple users wish to use an extension, it
only has to be installed once. In Apiary, it additionally provides the benefit that it
can become part of the application container’s definition, making it available to the
ephemeral container without requiring it to be manually installed by the user on each
ephemeral execution.
Similarly, one can create containers with functionality provided by other containers. A LATEX paper writing container can provide Emacs, LATEX and a PDF viewer.
This PDF viewer is separate from the primary PDF container and its ephemeral instances. This demonstrates how application containers can be designed to deliver a
specific functionality even when it overlaps with that of other parts of the system.
A user would want to include the PDF viewer within the LATEX container, as it is a
primary component of the paper-writing process, and not just a helper application
to be isolated. But as this copy of Xpdf is not made into a global program, no appli-
158
cation will call into this container. Because the layers are shared between containers,
it costs nothing to include it in the LATEX container. If Xpdf were not in the LATEX
container, users would have to go through multiple steps of copying the generated
PDF files to the PDF container to view them, as papers are not generally kept in the
/tmp directory.
7.3.2
Usage Study
We performed a usage study that evaluated the ability of users to use Apiary’s containerized application model with our prototype environment, focusing on their ability to execute applications from within other programs. Participants were mostly
recruited from within our local university, including faculty, staff and students. All of
the users were experienced computer users, including many experienced Linux users.
24 participants took part in the study.
For our study, we created three distinct environments. The first was a plain Linux
environment running the Xfce4 desktop. It provided a normal desktop Linux experience with a background of icons for files and programs and a full-fledged panel
application with a menu, task switcher, clock and other assorted applets. Second
was a full Apiary environment. It provided a much sparser experience, as the current
Apiary prototype only provides a set of applications and not a full desktop environment. Finally we supplied a neutered Apiary environment that differs from the full
environment in not launching any child applications within ephemeral containers.
The three environments enable us to compare the participants’ experience along
two axes. First, we can compare the plain Linux environment, where each application
is only installed once and always run from the same environment, to the neutered
Apiary environment, where each application is also only installed once and run from
159
the same environment. This allows us to measure the cost of using the Apiary viewer,
with its built-in taskbar and application menu, against plain Linux, where the taskbar
and application menu are regular applications within the environment. Second, the
full and neutered Apiary desktops enable us to isolate the actual and perceived cost to
the participants of instantiating ephemeral containers for application execution. We
presented the environments to the participants in random order and iterated through
all 6 permutations equally.
We timed the participants as they performed a number of specific multi-step tasks
in each environment that were designed to measure the overhead of using multiple
applications that needed to interact with one another. In summary, the tasks were:
(1) download and view a PDF file with Firefox and Xpdf and follow a link embedded
in the PDF back to the web; (2) read an email in Thunderbird that contains an
attachment that is to be edited in OpenOffice and returned to the sender; (3) create
a document in OpenOffice that contains text copied and pasted from the web and
sent by email as a PDF file; (4) create a “Hello World” web page in OpenOffice and
preview it in Firefox; and (5) launch a link received in the Pidgin IM client in Firefox.
As Figure 7.2 shows, the average time to complete each task, when averaged over
all the users doing tasks in random order, only differed by a few seconds in any direction for all tasks in all environments. Figure 7.2 shows that, in all cases, users
performed their tasks quicker in the neutered Apiary environment than in the plain
Linux environment. This indicates that Apiary’s simpler environment is actually
faster to use than the plain Linux environment with its bells and whistles like application launchers and applets running within taskbar panels. While this may seem
strange initially, it is perfectly understandable. Many environments that are simple
to use with minimal distractions, for example, the command line, are faster, but less
user-friendly, than others. Moreover, even though users were a little slower in the
100.0
160
Plain Linux
Persistent
Ephemeral
Time (s)
80.0
60.0
40.0
20.0
0.0
Task 1
Task 2
Task 3
Task 4
Task 5
Figure 7.2 – Usage Study Task Times
full Apiary environment than in the neutered version, they were still generally faster
than in the plain Linux environment. This indicates that while the full Apiary environment has a small amount of overhead, in practice, users are just as effective there
as in the plain Linux environment.
We also asked to rate their perceived ease of use of each environment. Most
users perceived the prototype environments to be as easy to use as the plain Linux
environment. While some users preferred the polish of the plain Linux environment,
more preferred the simplicity of the environment provided by Apiary. Most users
could not determine a difference between the full and neutered Apiary’s desktops.
We also asked the participants a number of questions, including whether they
could imagine using the Apiary environment full-time, and whether they would prefer
to do so if it would keep their desktop more secure. All of the participants expressed
a willingness to use this environment full-time, and a large majority indicated that
Test
Untar
Gzip
Octave
Kernel
161
Description
Untar a compress Linux 2.6.19 kernel source code archive
Compress a 250MB Linux kernel source tar archive
Octave 3.0.1 (MATLAB 4 clone) running a numerical benchmark [68]
Build the 2.6.19 kernel
Table 7.1 – Application Benchmarks
they would prefer to use Apiary over the plain Linux environment if it would keep
their applications more secure. The majority of those who would not prefer Apiary
expressed concern with bugs they perceived in the prototype. In addition, a few
expressed interest in the system, but said that their preference would depend on the
level of security they expected from the computer they were using.
7.3.3
Performance Measurements
7.3.3.1
Application Performance
To measure the performance overhead of Apiary on real applications, we compared
the runtime performance of a number of applications within the Apiary environment
against their performance in a traditional environment.
Table 7.1 lists our application tests. We focus mostly on file system benchmarks,
as others have shown [27, 100] that display and operating system virtualization have
little overhead. The untar tests file creation and throughput, while the gzip tests file
system throughput and computation. The Octave benchmark is a pure computation
benchmark. The kernel build benchmark tests computation as well as stressing the file
system, because of the large number of lookups that occur due to the large size of the
kernel source tree and the repeated execution of the preprocessor, compiler and linker.
To stress the system with many containers and provide a conservative performance
measure, each test was run in parallel with 25 instances. To avoid out-of-memory
1400.0
162
Plain
Apiary
1200.0
Time (s)
1000.0
800.0
600.0
400.0
200.0
0.0
Untar
Gzip
Octave
Kernel
Figure 7.3 – Application Performance with 25 Containers
conditions, as the Octave benchmark requires 100-200 MB of memory at various points
during its execution, we ran the benchmarks staggered 5 seconds apart to ensure they
kept their high memory usage areas isolated and avoided the benchmark’s being killed
by Linux’s out-of-memory handler. As is shown in Figure 7.3, Apiary imposes almost
no overhead in most cases, with about 10% overhead in the kernel build case, with
the VLFS’s constant need to perform lookups on the file system incurring an extra
cost. This demonstrates that Apiary is able to scale to a large number of concurrent
containers with minimal overhead.
7.3.3.2
Container Creation
For ephemeral containers to be useful, container instantiation must be quick. We
measured this cost in two ways: first, how long it takes to instantiate its VLFS, and
second, how long the application takes to start up within the container. We quantify
how long it takes to instantiate a container and compare Apiary to other common
Create
Extract
FS-Snap
Apiary
Pidgin
317 s
82 s
.016 s
.005 s
Firefox
276 s
86 s
.015 s
.005 s
T-Bird
294 s
87 s
.016 s
.005 s
OOffice
365 s
150 s
.020 s
.005 s
163
Xpdf
291 s
81 s
.009 s
.005 s
MPlayer
294 s
81 s
.010 s
.005 s
Table 7.2 – File System Instantiating Times
approaches. We compare how long it takes to setup a VLFS against how long it
takes to setup a container file system using Debian’s traditional bootstrapping tools
(Create), how long it would take to extract the same file system from a tar archive
(Extract), and how long it takes a file system with a snapshot operation to create
a new snapshot and branch of a preexisting file system namespace (FS-Snap), as
shown in Table 7.2. To minimize network effects with the bootstrapping tools, we
used a local Debian mirror on the local 100Mbps campus network, and were able to
saturate the connection while fetching the packages to be installed.
Table 7.2 shows that Apiary instantiates containers with a VLFS composed of
nearly 200 layers nearly instantaneously. This compares very positively with traditional ways of setting up a system. Table 7.2 show that it takes a significant amount
of time to create a file system for the application container using Debian’s bootstrapping tool, and even extracting a tar archive takes a significant amount of time as
well. This discourages creating ephemeral application containers, as users will not
want to wait minutes for their applications to start. Tar archives also suffer from
their need be actively maintained and rebuilt whenever they need fixes. Therefore,
the amount of administrative work increases linearly with the number of applications
in use. As Apiary creates the file system nearly instantaneously, it is able to support
the creation of ephemeral application containers with no noticeable overhead to the
users. While Table 7.2 shows that file systems (in this case Btrfs) with a snapshot
and branch operation can also perform it quickly, the user would have to manage
20.0
18.0
16.0
164
Plain (C)
Persistent (C)
Plain (W)
Persisent (W)
Ephemeral
Time (s)
14.0
12.0
10.0
8.0
6.0
4.0
2.0
Pidgin
Firefox
T-bird
OOffice Mplayer
Xpdf
Figure 7.4 – Application Startup Time
each of the application’s independent file systems separately.
To quantify startup time, we measured how long it takes for the application to open
and then be automatically closed. In the case of Firefox, Xpdf and OpenOffice.org,
this includes the time it takes to display the initial page of a document, while Pidgin,
MPlayer and Thunderbird are only loading the program. For ephemeral containers,
we measure the total time it takes to set up the container and execute the application
within it. Ephemeral containers differ from persistent containers only in the time it
takes to set up the new ephemeral container, which is never a cold-cache operation
because the system is already in use. We compare these results to cold and warm cache
application startup times for both plain Linux and Apiary’s persistent containers.
We include cold cache results for benchmarking purposes and warm cache results to
demonstrate the results users would normally see.
As Figure 7.4 shows, while running within a container induced some overhead
165
on startup, it is generally under 25% in both cold and warm cache scenarios. This
overhead is mostly due to the added overhead of opening the many files needed by
today’s complex applications. The most complex application, OpenOffice, requires
the most, while the least complex application, Xpdf, is almost equivalent to the plain
Linux case. In addition, while the maximum absolute extra time spent in the cold
cache case was nearly 5 seconds for OpenOffice, in the warm cache case it dropped
to under 0.5 seconds. In addition, ephemeral containers provide an interesting result.
Even though they have a fresh new file system and would be thought to be equivalent
to a cold cache startup, they are nearly equivalent to the warm cache case. This is
because their underlying layers are already cached by the system due to their uses
by other containers. The ephemeral case has a slightly higher overhead due to the
need to create the container and execute a display server inside of it in addition to
regular application startup. However, as this takes under 10 milliseconds, it adds
only a minimal amount to the ephemeral application startup time.
7.3.4
File System Efficiency
To support a large number of containers, Apiary must store and manage its file system
efficiently. This means that storage space should not significantly increase with an
increasing number of instantiated containers and should be easily manageable in
terms of application updates. For each application’s VLFS, Table 7.3 shows its size,
its number of layers, the amount of state shared with the other application VLFSs,
and the amount of state unique to it. For instance, the 129 layers that make up
Firefox’s VLFS require 353 MB, of which 330MB are shared with other applications
and 23 MB are unique to the Firefox VLFS. In general, as Table 7.3 shows, there
is a lot of duplication among the containers, as the layer repository of 214 distinct
Repo
743 MB
# Layers
Shared
Unique
Pidgin
394 MB
147
322 MB
72 MB
Firefox
353 MB
129
330 MB
23 MB
T-Bird
367 MB
125
335 MB
32 MB
OOffice
645 MB
186
329 MB
316 MB
166
Xpdf
339 MB
130
330 MB
9 MB
MPlayer
355 MB
162
326 MB
29 MB
Table 7.3 – Apiary’s VLFS Layer Storage Breakdown
Size
Single FS
743 MB
Multiple FSs
2.1 GB
VLFSs
743 MB
Table 7.4 – Comparing Apiary’s Storage Requirements Against a Regular Desktop
Avg. Time
Traditional
18 s
Apiary
0.12 s
Table 7.5 – Update Times for Apiary’s VLFSs
layers needed to build the different VLFSs for the different applications is the same
magnitude as the largest application.
Table 7.4 shows that using individual VLFSs for each application container consumes approximately the same amount of file system space as a regular desktop file
system containing all the applications because each layer only has to be stored once.
This is comparison to the traditional method of provisioning multiple independent file
systems for each application container, which consumes a significantly larger amount
of disk space. Similarly, if multiple desktops are provided on a server, the VLFS usage
would remain constant with the size of the repository, while the other cases would
grow linearly with the number of desktops.
To demonstrate how Apiary allows users to maintain their many containers efficiently, we instantiated one container for each of the five applications previously
mentioned. When a security update was necessary [146], we applied the update to
each container. Table 7.5 shows the average times for the five application container
file systems. This demonstrates that while individual updates by themselves are not
too long, when there are multiple container file systems for each individual user, the
2000.0
Time (s)
1500.0
167
Plain Desktop
Pidgin VLFS
Firefox VLFS
T-bird VLFS
OpenOffice VLFS
Xpdf VLFS
MPlayer VLFS
1000.0
500.0
0.0
Postmark
Figure 7.5 – Postmark Overhead in Apiary
amount of time to apply common updates will rise linearly, and as the traditional
method is two orders of magnitude greater than Apiary, it will be impacted to a
much greater extent.
7.3.5
File System Virtualization Overhead
To measure the virtualization cost of VLFS in the Apiary operating system virtualization environment, we re-ran the benchmarks from Chapter 6. These benchmarks
differ from Chapter 6 in that they are not run within a hardware virtual machine,
but rather within an operating system virtualization namespace, and that instead of
the backing store of the VLFS being on a fast SAN device, they are on the slower
host machine disks.
Figure 7.5 shows that Postmark runs faster within a plain Linux environment
than when run within the VLFS. However, it should be noted that these results show
300.0
250.0
Time (s)
200.0
168
Plain Desktop
Pidgin VLFS
Firefox VLFS
T-Bird VLFS
OpenOffice VLFS
Xpdf VLFS
MPlayer VLFS
150.0
100.0
50.0
0.0
Kernel Build
Figure 7.6 – Kernel Build Overhead in Apiary
significantly less overhead that those in Chapter 6. This is because even though
the disks are slower, as indicated by the plain Linux results, the operating system
virtualization overhead is minimal compared to the overhead imposed by the virtual
machine monitor in Chapter 6. Most notable is a decrease in memory pressure which
enables the VLFS to operate more efficiently because more data can remain cached.
Figure 7.6 shows similar results with the multi-threaded build of the Linux 2.6.18.6
kernel. In Chapter 6, the VLFS showed a 5% overhead; here, overhead is essentially
zero. Even though the SAN’s file system, used for the tests in Chapter 6, is significantly faster than the blade’s file system, the results here are much faster overall.
This again indicates the amount of overhead imposed by virtual machine monitors
over operating system virtualization.
7.4
169
Related Work
Isolation mechanisms such as VMs [143, 147] and OS containers [7, 116] are commonly used to increase the security of applications. However, if used for desktop
applications, this isolation prevents an integrated desktop experience. Products like
VMware’s Unity [143] attempt to solve part of this issue by combining the applications from multiple VMs into a single display with a single menu and taskbar, as well
as providing file system sharing between host and VMs. The applications, however,
are still fully isolated from one another, preventing them from leveraging other applications installed into separate VMs. While VMs provide superior isolation, they
suffer higher overhead due to running independent operating systems. This impacts
performance and makes them less suited for ephemeral usage on account of their long
startup times. However, Apiary can leverage them if one does not want to trust a
single operating system kernel.
Tahoma [122] is similar to Apiary in that it creates fully isolated application
environments that remain part of a single desktop environment. Tahoma creates
browser applications that are limited to certain resources, such as certain URLs, and
that are fully isolated from each other. Tahoma is similar to Apiary in that it enables
the creation of isolated application environments. However, it only provides these
isolated application environments for web browsers. It does not provide any way
to integrate these isolated environments and does not provide ephemeral application
environments. Google’s Chrome web browser [66] builds upon some of these ideas to
isolate web browser pages within a single browser. But the browser as a whole does
not offer any isolation from the system. While its multiple-process model uses OS
mechanisms to isolate separate web pages that are concurrently viewed, it does not
provide any isolation from the system itself. For instance, any plugin that is executed
170
has the same access to the underlying system as does the user running the browser.
Modern web browsers improve privacy by providing private browsing modes that
prevent browser state from being committed to disk. While they serve a similar
purpose to ephemeral containers, private browsing is fundamentally different. First,
it has to be written into the program itself. Many different types of programs have
privacy modes to prevent them from recording state and this model requires them
to implement it independently. Second, it only provides a basic level of privacy.
For instance, it cannot prevent a plugin from writing state to disk. Furthermore, it
makes the entire browser and any helper program or plugin that it executes part of the
trusted computing base (TCB). This means that the user’s entire desktop becomes
part of the TCB. If any of those elements gets exploited, no privacy guarantees can
be enforced. Apiary’s ephemeral containers make the entire execution private and
support any application with a state a user desires to remain private without any
application modifications. It also keeps the TCB much smaller, by only requiring that
the underlying OS kernel and the minimal environment of Apiary’s system daemon
be trusted.
Lampson’s Red/Green isolation [82] and WindowBox [23] resemble Apiary’s ability to run multiple applications in parallel. These isolation schemes involve users running two or more separate environments, for instance, a red environment for regular
usage and a green environment for actions requiring a higher level of trust. However,
unlike Apiary’s ephemeral containers, if an exploit can enter the green container, it
will persist. Furthermore, by requiring two separate virtual machines, one increases
the amount of work a user has to do to manage their machines. Apiary, by leveraging
the VLFS, minimizes the overhead required required to manage multiple machines.
Storage Capsules [33] also attempt to mitigate this problem by securely running the
applications requiring trust in the same operating system environment as the un-
171
trusted applications, while keeping their data isolated from one another. However,
this involves significant startup and teardown costs for each execution within a secure
storage capsule.
File systems and block devices with branching or COW semantics [32,103,128] can
be used to create a fresh file system namespace for a new container quickly. However,
these file systems do not help to manage the large number of containers that exist
within Apiary. Because each container has a unique file system with different sets
of applications, administrators must create individual file systems tailored to each
application. They cannot create a single template file system with all applications
because applications can have conflicting dependency requirements or desire to use
the same file system path locations. Furthermore, if all applications are in a single file
system, they are not isolated from each other. This results in a set of space-inefficient
file systems, as each file system has an independent copy of many common files. This
inefficiency also makes management harder. When security holes are discovered and
fixed, each individual file system must be updated independently.
Many systems have been created that attempt to provide security through isolation mechanisms [17, 30, 49, 84, 86, 118, 144]. All these systems differ from Apiary
in that they try to isolate the many different components that make up a standard
fully-integrated single system using sets of rules to determine which of the machine’s
resources the application should be able to access. This often results in one of two
outcomes. First, a policy is created that is too strict and does not let the application
run correctly. Second, a policy is created that is too lenient and lets an exploited application interact with data and applications it should not be able to access. Apiary,
on the other hand, forces each component to be fully isolated within its own container
before determining on which levels it should be integrated. As each container provides
all the resources that the application needs to execute in an isolated environment, no
172
complicated rule sets have to be created to determine what it can access.
Solitude [72] provides isolation via its Isolation File System (IFS), which a user
can throw away. This is similar to Apiary’s ephemeral containers. However, the IFSs
are not fully isolated. First, Solitude does not create a new IFS for each application
execution. Second, the IFS is built on top of a base file system with which it can
share data, breaking the isolation. To handle this, Solitude implements taint tracking
on files shared with the underlying base file system. This helps determine post facto
what other applications may have been corrupted by a maliciously constructed file.
Similarly, Solitude only provides isolation at the file system level. Because each application still shares a single display, malicious and exploited applications can leverage
built-in mechanisms in commodity display architectures [61, 93] to insert events and
messages into other applications sharing the display.
Chapter 8
ISE-T: Two-Person Control
Administration
All organizations that rely on system administrators to manage their machines must
prevent accidental and malicious administrative faults from entering their systems.
As systems become more complex, it gets easier for administrators to make mistakes.
From a security perspective, these complex systems create an environment where it
is easier for a rogue user, whether an insider or outsider, to hide their attacks. For
example, Robert Hanssen, an FBI agent who was a Soviet spy, was able to evade
detection because he was the administrator of some of the FBI’s counterintelligence
computer systems [149]. He could see whether the FBI had identified his drop sites
and if he was being investigated [45].
Most approaches to insider attacks involve intrusion detection or role separation,
both of which are ineffective against rogue system administrators who can replace the
system module that enforces the separation or performs the intrusion detection. This
attack vector was described over thirty years ago by Karger and Schell [101] and still
remains a serious problem. Even if administrators can be trusted, they must deal
Chapter 8. ISE-T: Two-Person Control Administration
174
with very complicated software, and it is hard to catch mistakes before they cause
problems. If a mistake takes down an important service, the machine may not be
usable or administratable, and malicious attackers can act with impunity.
There are several ways to address faults, including partitioning, restore points and
peer review. One highly effective approach is two-person control [13], for example,
two pilots in an airplane, two keys for a safe deposit box, or running two or more
computations in parallel and comparing the results. We believe this concept can
be extended to problems in system administration by using virtualization to create
duplicate environments.
Toward this end, we created the “I See Everything Twice” [70] (ISE-T, pronounced
“ice tea”) architecture. ISE-T provides a general mechanism to clone execution environments, independently execute computations to modify the clones, and compare
how the resulting modified clones have diverged. The system can be used in a number of ways, such as performing the same task in two initially identical clones, or
executing the same computation in the same way in clones with some differences. By
providing clones, ISE-T creates a system where computation actions can be “seen
twice,” applying the concept used for fault-tolerant computing to other forms of twoperson control systems. There is, however, a crucial difference between our use of
replicas and that of fault-tolerant computing. We test for equivalence between two
replicas that may not be identical, rather than simply running two identical replicas
in lockstep and ensuring they remain identical.
By applying the ISE-T architecture to system administration, we are able to introduce the two-person control concept to system administration. As ISE-T allows
a system to be easily cloned into multiple distinct execution domains, we can create
separate cloned environments for multiple administrators. ISE-T can then compare
the sets of changes produced by each administrator to determine if equivalent changes
175
were made. ISE-T allows administration to proceed in both a fail-safe and an auditable manner.
ISE-T forces administrative acts to be performed multiple times before they are
considered correct. Current systems give full access to the machine to individual
administrators. This means that one person can accidentally or maliciously break
the system. ISE-T offers a new way to avoid this problem. ISE-T does not allow
any administrator to modify the underlying system directly, but instead creates individual clones for two administrators to work on independently. ISE-T is then able
to compare the changes each administrator performs. If the changes are equivalent,
ISE-T has a high assurance that the changes are correct and will commit them to
the base system. But if it detects discrepancies between the two sets of changes, it
will notify the administrators so that they can resolve the problem. This enables
fail-safe administration by catching accidental errors, while also preventing a single
administrator from maliciously damaging the system.
ISE-T leverages both virtualization and unioning file systems to produce the
clones. ISE-T uses both operating system virtualization, as in Solaris Zones [116]
and Linux VServer [7], and hardware virtualization as in VMware [142], to provide
each administrator with an isolated environment. ISE-T builds upon DejaView [81]
and Strata, using union file systems to yield a layered file system that provides the
initial file system namespace in one layer, while capturing all the system administrator’s file system changes in a separate layer. This allows easy isolation of changes,
simplifying equivalence testing. ISE-T’s requiring everything to be installed twice
blocks many real attacks. A single malicious system administrator cannot create an
intentional back door, weaken firewall rules, or create unauthorized user accounts.
ISE-T is admittedly an expensive solution, too expensive for many commercial
sites. For high-risk situations, such as in the financial, government, and military sec-
176
tors, the added cost may be acceptable if risk is reduced. In fact, two-person controls
are already routine in those environments, ranging from checks that require two signatures to requiring two people for nuclear weapons work. But we also demonstrate
how ISE-T can be used in a less expensive manner by introducing a form of auditable
system administration. Instead of requiring two system administrators at all times,
ISE-T can save all the changes performed by the system administrator to a log, which
is audited to provide a higher level of assurance that the administrator is behaving
properly.
In a similar manner, ISE-T can be extended to train less experienced system administrators. First, ISE-T allows a junior system administrator to perform tasks in
parallel with a more senior system administrator. While only the senior administrator’s solution will be committed to the system, the junior system administrator can
learn from how their solution differs from the senior system administrator’s. Second,
ISE-T can be extended to provide an approval mode, in which a junior system administrator is given tasks to complete, but instead of being committed immediately,
they will be presented for the senior system administrator to approve or disapprove.
8.1
Usage Model
Systems managed by ISE-T are used by two classes of users, privileged and unprivileged. ISE-T does not change how regular users interact with the machine. They
are able to install any program into their personal space and run any program on the
system, including regular programs and UNIX programs such as setuid and passwd
that raise the privileges of the process on execution.
However, ISE-T fundamentally changes the way system administrators interact
with the machine. In regular systems, when administrators need to perform mainte-
Administrative Clone #1
177
Administrative Clone #2
ISE-T Service
System
Figure 8.1 – ISE-T Usage Model
nance on the machine, they use their administrative privilege to run arbitrary programs, for example, by executing a shell or using sudo. In these systems, administrators can modify the system directly.
As ISE-T prevents system administrators from executing arbitrary programs with
administrative privileges, the above model will not work with ISE-T. Instead, ISE-T
provides a new approach as shown in Figure 8.1. Instead of administering a system directly, ISE-T creates administration clones. Each clone is fully isolated from
others and from the base system. ISE-T instantiates a clone for each administrator.
Once both administrators are finished making changes, ISE-T compares the clones
for equivalence and commits the changes if they pass the test. As opposed to a regular system, where the administrator can interleave file system changes with program
execution, in ISE-T only file system changes are committed to the underlying system.
178
Therefore ISE-T requires administrators to use other methods if they require file system changes and program execution to be interleaved on the actual system, such as
for rotating log files or exploratory changes to diagnose a subtle system malfunction.
To allow this, ISE-T provides a new ise-t command that is used in a manner
similar to su. Instead of spawning a shell on the existing system, ise-t spawns a
new isolated container for that administrator. This container contains a clone of the
underlying file system. Within this clone, the administrators can perform generic
administrative actions, as on a regular system, but the changes will be isolated to
this new container. When the administrators are finished with their changes, they
exit the new container’s shell, much as they would exit a root shell; the container
itself is terminated, while its file system persists.
ISE-T then compares the changes each administrator performed for equivalence.
ISE-T performs this task automatically after the second administrator exits their
administration session and notifies both of the administrators of the results. If the
changes are equivalent, ISE-T automatically commits the changes to the underlying base system. Otherwise, ISE-T notifies the administrators of the file system
discrepancies that exist between the two administration environments, allowing the
administrators to correct them.
Command
ise-t new
ise-t enter
ise-t done
ise-t diff
Description
Create an administration environment
Enter administration environment
Ready for equivalence testing
Results of a failed equivalence test
Table 8.1 – ISE-T Commands
Because ISE-T only looks at file system changes, this can prevent it from performing administrative actions that affect only the runtime of the system. To address
this, ISE-T provides a raw control mechanism via the file system, and allows itself
179
to be integrated with configuration management systems. First, ISE-T’s raw control
mechanism is implemented via a specialized file system namespace where an administrator can write commands. For instance, if the administrators want to kill a process,
stop a service or reboot the machine, those actions performed directly within their
administration container will have no effect on the base system. Some actions can
be inferred directly from the file system. For instance, if the system’s set of startup
programs is changed, ISE-T can infer that the service should be started, stopped or
restarted when the changes are committed to the underlying system. But this approach only helps when the file system is being changed. Sometimes administrators
want to stop or restart services without modifying the file system. ISE-T therefore
provides a defined method for killing processes, stopping and starting services, and
rebooting the machine using files stored on the local file system. ISE-T provides
each administrator with a special /admin directory for performing these predefined
administrative actions.
For example, if the administrator wants to reboot the machine, they create an
empty reboot file in the /admin directory. If both administrators create the file,
the system will reboot itself after the other changes are committed. Similarly, the
administrators can create a halt file to halt the machine. In addition, the /admin
directory has kill and services subdirectories. To kill a process, administrators
create individual files with the names of the process identifiers of processes running
on the base system that they want to kill. Similarly, if a user desires to stop, start, or
restart a init.d service, they create a file named by that service prefixed with stop,
start, or restart, such as stop.apache or restart.apache within the services
directory. ISE-T performs the appropriate actions when the changes are committed
to the base system. The files created within the /admin directory are not committed
to the base system; they are only used for performing runtime changes to the system.
180
Many systems already exist to manage systems and perform these types of tasks,
namely, configuration management systems such as lcfg [19]. At a high level, configuration management systems work by storing configuration information on a centralized policy server that controls a set of managed clients. In general, the policy
server will contain a set of template configuration files that it uses to create the actual
configuration file for the managed clients based on information contained in its own
configuration. Configuration management systems also generally support the ability
to run predefined programs and scripts and execute predefined actions on the clients
they are managing.
When ISE-T is integrated with any configuration management system, it no longer
manages the individual machines. Instead of the managed clients being controlled
by ISE-T, the configuration policy server is managed by ISE-T and the clients are
managed by the configuration management system. This offers a number of benefits. First, it simplifies the comparison of two different systems, as ISE-T can focus
on the single configuration language of the configuration management system. Second, configuration system already have tools to manage the runtime state of their
client machines, such as stopping and starting services and restarting them when the
configuration changes. Third, many organizations are already accustomed to using
configuration management systems. By implementing ISE-T on the server side, they
can enforce the two-person control model in a more centralized manner.
8.2
ISE-T Architecture
To implement the two-person administrative control semantic, ISE-T provides three
architectural components. First, because the two administrators cannot administer
the system directly, they must be provided with isolated environments in which they
181
can perform their administrative acts. To ensure isolation, ISE-T provides container
mechanisms that allow ISE-T to create parallel environments based on the underlying
system to be administered. This allows ISE-T to fully isolate each administrator’s
clone environment from each other and from the base system.
Second, we note that any persistent administrative action must involve a change
to the file system. If the file system is not affected, the action will not survive a
reboot. While some administrative acts only affect the ephemeral runtime state of
the machine, the majority are more persistent. The file system is therefore a central
component in ISE-T’s two-person administrative control. ISE-T provides a file system
that can create branches of itself as well as isolate the changes made to it. This allows
for easy creation of clone containers and comparison of the changes performed to both
environments.
Finally, ISE-T provides the ISE-T System Service. This service instantiates and
manages the lifetimes of the administration environments. It is able to compare
the two separate administration environments for equivalence to determine if the
changes performed to them should be committed to the base system. ISE-T’s System
Service performs this via an equivalence test that compares the two administration
environment’s file system modifications for equivalence. If the two environments are
equivalent, the changes will be committed to the underlying base system. Otherwise,
the ISE-T System Service will notify the two administrators of the discrepancies and
allow them to fix their environments appropriately.
8.2.1
Isolation Containers
ISE-T can leverage multiple types of container environments depending on administrative needs. In general, the choice will be between hardware virtual machine
182
containers and operating system containers. Hardware virtual machines such as
VMware [142] provide a virtualized hardware platform with a separate operating
system kernel, yielding a complete operating system instance. Operating system
containers such as Solaris Zones [116], however, are just isolated kernel namespaces
running on a single machine.
For ISE-T, there are two main differences between these containers. First, hardware virtual machines allow the administrators to install and test new operating
system kernels, as each container will be running its own kernel. Operating system
containers, on the other hand, prevent the administrators from testing the underlying
kernel, as there is only one kernel running, that of the underlying host machine. Second, as hardware virtual machines require their own kernel and a complete operating
system instance, they make it time-consuming to create administration clones. Operating system containers, however, can be created almost instantly. As both types of
containers have significant benefits for different types of administrative acts, ISE-T
supports both. For most actions, administrators will prefer operating system containers, but they can still use a complete hardware virtual machine to test kernel
changes.
When ISE-T is integrated with a configuration management system, ISE-T does
not have to use any isolation container mechanism at all, as the configuration management system already isolates the administrators from the client system. Instead,
ISE-T simply provides each administrator with their own configuration management
tree and lets both administrators perform the changes.
8.2.2
183
ISE-T’s File System
To support its file system needs, ISE-T leverages the branching ability of some file
systems. Unlike a regular file system, a branchable file system can be snapshot at
some point in time and branched for future use. This allows ISE-T to quickly clone
the file system of the machine being managed. Because each file system branch is
independent, ISE-T can capture any file system changes in the newly created branch
by comparing the branch’s state to the initial file system’s state. Similarly, ISE-T
can then compare the sets of file system changes from both administration clones to
one another.
Although a classical branchable file system allows changes to be captured, it does
not make it possible to determine efficiently what has changed, because the branch
is a complete file system namespace. Iterating through the complete file system can
take a significant amount of time, place a large strain on the file system, and decrease
system performance. Two features allow ISE-T to use a file system efficiently. First,
it must be able to duplicate the file system to provide each administrator with their
own independent file system on which to make changes. Second, it must allow easy
isolation of each administrator’s changes to test them for equivalence.
To meet these requirements, ISE-T creates layered file systems for each administration environment. Multiple file systems can be layered together into a single file
system namespace for each environment. This enables each administration environment to have a layered file system composed of two layers, a single shared layer that
is the file system of the machine they are administrating, as well as a layer containing
all the changes the administrator makes on the file system.
8.2.3
184
ISE-T System Service
ISE-T’s System Service has a number of responsibilities. First, it manages the lifetimes of each administrator’s environment. When administration is required, it has to
set up the environments quickly. Similarly, when the administration session has been
completed and the changes committed to the underlying system, it removes them
from the system and frees up their space. Third, it evaluates the two environments
for equivalence by running a number of equivalence tests to determine if the two administrators performed the same set of modifications. Finally, it has to either notify
the administrators of the discrepancies between their two environments or commit
the equivalent environment’s changes to the underlying base system.
ISE-T’s layered file system allows the system service to easily determine which
changes each administrator made, as each administrator’s changes are confined to
their personal layer of the layered file system. To determine if the changes are equivalent, ISE-T first isolates the files that will not be committed to the base system, that
is, the administrator’s personal files in their branch, such as shell history. Instead of
merely removing them, ISE-T saves them for archival and audit purposes. ISE-T then
iterates through the files in each environment, comparing the file system contents and
files directly to one another. If each administrator’s branch has an equivalent set of
file system changes, ISE-T can then simply commit a set to the base system. On the
other hand, if the files contained within each branch are not equivalent, ISE-T flags
the differences and reports them. The administrators then confer to ensure that they
perform the same steps to create the same set of files to commit to the base system.
Ways of determining equivalence can vary based on the type of file and what is
considered to be equivalent in context. For instance, a configuration file modified
by both administrators with different text editors can visually appear equivalent,
185
but can differ if one uses spaces and another uses tabs. These files are equivalent
insofar as applications parse them the same way, but are different on a character by
character level. However, there are some languages (e.g., Python) where the amount
of white space matters and can have a great effect on how the script executes. On
the other hand, two files that have exactly the same file contents can have varying
metadata associated with the file, such as permissions, extended attributes, or time
data. Similarly, some sets of files need not be compared for equivalence, such as
the shell history that records the steps the administrators take in their respective
environments, and, in general, the home directory contents of the administrator in
his administration environment. ISE-T removes these files from the comparison, and
never commits them to the underlying system.
Taking this into consideration, ISE-T’s prototype comparison algorithm determines these sets of differences.
1. Directory entries which do not exist in both sets of changes are different.
2. Directory entries with different UIDs, GIDs, or permission sets are different.
3. Directory entries of different file types (Regular File, Symbolic Link, Directory,
Device Node, or Named Pipe) are different.
For directory entries of the same type, ISE-T performs the appropriate comparison.
• Device nodes must be of the same type.
• Symbolic links must contain the exact same path.
• Regular files must have the same size and the exact same contents.
186
There are two major problems with this approach. First, this comparison takes
place at a very low semantic level. It does not take into account simple differences
between files that make no difference in practice. However, without writing a parser
for each individual configuration language, one will not easily be able to compare
equivalence. Second, there are certain files, such as encryption keys, that will never be
generated identically, even though equivalent actions were taken to create them. This
can be important, as some keys are known to be weaker and a malicious administrator
can construct one by hand.
Both of these problems can be solved by integrating ISE-T with a configuration
management system and teaching ISE-T the configuration management system’s language. First, these systems simplify the comparison by enabling it to focus on the
configuration management system’s language. Even though most configuration management systems work by creating template configuration files for the different applications, these files are not updated regularly and can be put through the stricter
exact comparison test. On the other hand, when ISE-T understands the language
of the configuration management system, it can rely on a more relaxed equivalence
test. Second, configuration management systems already deal with dynamic files like
encryption keys. A common way configuration management systems deal with these
types of files is by creating them directly on the managed client machines. Because
ISE-T understands the configuration management system’s language, the higher level
semantics that instruct the system to create the file will be compared for equivalence
instead of the files themselves. However, a potential weakness of ISE-T is in dealing
with files that cannot easily be created on the fly and will differ between two system
administration environments, such as databases. For instance, two identical database
operations can result in different databases due to different timestamps or reordering
of updates on the database server.
8.3
187
ISE-T for Auditing
Although the two-person control is useful for providing high assurance that faults
are not going to creep into the system, its expense can make it impractical in many
situations. For example, since the two-person control model requires the concurrence
of two system administrators on all actions, it can prevent time-sensitive actions if
only a single administrator is available. Similarly, while the two-person control model
provides a very high degree of assurance for a price, it would be useful if organizations
could get a somewhat higher degree of assurance at a lower price. To achieve these
goals, we can combine ISE-T’s mechanisms with audit trail principles to create an
auditable system administration semantic.
In auditable system administration, every system administration act is logged to
a secure location for review. The ISE-T System Service creates cloned administration environments for the two administrators and can capture the state they change
in order to compare for equivalence. For auditable system administration, ISE-T’s
mechanism can also be used. The audit system prevents the single system administrator from modifying the system directly, instead requiring the creation of a cloned
administration environment where the administrator can perform the changes before
they are committed to the underlying system. Instead of comparing for equivalence
against a second system administrator, the changes are logged so that they can be
examined at some time in the future, while being immediately committed to the underlying system. Audit systems are known to increase assurance against malicious
changes, as the would-be perpetrator knows there is a good chance their actions will
be caught. Similarly, depending on the frequency and number of audits performed, it
can help prevent administration faults from persisting for long periods of time in the
system. However, it does not provide as much assurance as two-person control, be-
188
cause the administrator can use the fact that his changes are committed immediately
to create back doors in the system that will not be discovered until later.
Auditable system administration needs to be tied directly to an issue-tracking service. This allows an auditor to associate an administrative action with its intended
result. Every time an administrator invokes ISE-T to administer the system, an
issue-tracking number is passed into the system to tie that action to the issue in the
tracker. This allows the auditor to compare the actual results with what the auditor
expects to have occurred. In addition, auditable system administration can be used
in combination with the two-person control system when only a single administrator
is available and immediate action is needed. With auditing, the action can be performed by the single administrator, but can be immediately audited when the second
administrator becomes available.
8.4
To test the efficacy of ISE-T’s layered file system approach, we recruited 9 experienced
computer users with varying levels of system administration experience, though all
were familiar with managing their own machines. We provided each user with a
VMware virtual machine running Debian GNU/Linux 3.0. Each VM was configured
to create an ISE-T administration environment that would allow the users to perform
multiple administration tasks isolated from the underlying base system. Our ISE-T
prototype uses UnionFS [150] to provide the layered file system needed by ISE-T. We
asked the users to perform the eleven administration tasks listed in Table 8.2. The
user study was conducted in virtual machines running on an IBM HS20 eServer blade
with dual 3.06 Ghz Intel Xeon CPUs and 2.5GB RAM running VMware Server 1.0.
These tasks were picked to be representative of common administration tasks, and
Category
Software
Installation
System
Services
Configuration
Changes
Exploit
Description
Install official rdesktop package
Compile & install rdesktop from source
Install all pending security updates
Install SSH daemon from package
Remove PPP using package manager
Edit machine’s persistent hostname
Edit the inetd.conf to enable a service
Add a daily run cron job
Remove an hour run cron job
Change the time of a cron job
Create a backdoor setuid root shell
189
Result
Equivalent
Equivalent
Equivalent
Not Equivalent
Equivalent
Equivalent
Not Equivalent
Equivalent
Equivalent
Equivalent
Not Equivalent
Desired
Yes
Yes
Yes
No
Yes
Yes
No
Yes
Yes
Yes
Yes
Table 8.2 – Administration Tasks
included a common way for a malicious administrator to create a back door in the
system.
Each task was performed in a separate ISE-T container, so that each administration task was isolated from the others, and none of the tasks depended on the results
of a previous task. We used ISE-T to capture the changes each user performed for
each task in its own file system. We were then able to compare each user against the
others for each of the eleven tasks to see where their modifications differed.
For every test, ISE-T prunes the changes to remove files that would not affect
equivalence, as described in Chapter 8.2.3. Notably, in our prototype, ISE-T prunes
the /root directory, which is the home directory of the root user, and therefore
would contain differences in files such as .bash history, among others, that are
particular to each user’s approach to the task. Similarly, ISE-T prunes the /var
subtree to remove any files that are not equivalent. For instance, depending on what
tools an administrator uses, different files are created. A cache of packages might be
downloaded and installed via the apt-get tool instead of manually. The reasoning
behind this pruning is that the /var tree is meant as a read-write file system for
190
per-system usage. Tools will modify it; if different tools are used, different changes
will be made. The entire directory tree cannot be pruned, however, because there
are files or directories within it that are necessary for runtime use and those changes
have to be committed to the underlying file system. Therefore, only those changes
that are equivalent are committed, while those that are different were ignored. ISE-T
also prunes the /tmp directory, as the contents of this directory would also not be
committed to the underlying disk. Finally, due to the UnionFS implementation used
for these experiments, ISE-T also prunes the whiteout files created by UnionFS if
there is no equivalent file on the underlying file system. In many cases, temporary
files with random names will be created; when they are deleted, UnionFS will create
a whiteout file, even if there is no underlying file to whiteout. As this whiteout file
does not have an impact on the underlying file system, it is ignored. On the other
hand, whiteout files that do correspond to underlying files and therefore indicate that
the file was deleted are not ignored.
8.4.1
Software Installation
In the software installation category, we had the users perform three separate tests
to demonstrate that when multiple users install the same piece of software, as long
as they install it in the same general way, the two installations will be equivalent.
To demonstrate this, the users were first instructed to install the rdesktop program
from its Debian package. Users had multiple ways of installing the package, including
downloading and installing it by hand via dpkg, using apt-get to download it and
any unfulfilled dependencies, as well as using the aptitude front end to apt-get.
Most users decided to install the package via apt-get, but even those who did not
made equivalent changes. The only differences were in pruned directories, demon-
191
strating that installing a piece of pre-packaged software using regular tools results in
an equivalent system.
Second, the users were instructed to build the rdesktop program from source code
and install it into the system. In this case, multiple differences could have occurred.
First, if the compiler were to create a different binary each time the source code is
compiled, even without any changes, it would be difficult to check for equivalence.
Second, programs generally can be installed in different areas of the file system, such
as /usr versus /usr/local. In this case, all the testers decided to install the program
into the default location, avoiding the latter problem, while also demonstrating that
as long as a the same source code is compiled by the same tool chain, it will result in
the same binary. However, some program source code, such as the Linux kernel, will
dynamically modify its source during build, for example to define when the program
was built. In these cases, we would expect equivalence testing to be more difficult,
as each build will result in a different binary. A simple solution would be to patch
the source code to avoid this behavior. A more complicated solution would involve
evaluating the produced binary’s code and text sections with the ability to determine
that certain text section modifications are inconsequential. Again, in this case, the
only differences were in pruned directories, notably the /root home directory, to
which the users downloaded the source for rdesktop.
Finally, we instructed the users to install all the pending security updates. This
is more complicated than the first test, as many packages were upgraded. Although
differences existed between the environments of the users, the differences were confined to the /var file system tree and depended on how they performed the upgrade.
This is because Debian provides multiple ways to do an upgrade of a complete system
and those cause different log files to be written. As they all installed the same set of
packages, the rest of the file system, as expected, contained no differences.
8.4.2
192
System Services
Our second set of tests involved adding and removing services. Users were instructed
to install SSH and remove PPP. These tests were an extension of the previous package installation tests and demonstrated how one would automatically start and stop
services, as well as a demonstration of files we knew would fail equivalence testing.
For the first test, we instructed the users to install the SSH daemon. This test
sought to demonstrate that ISE-T can detect when a new service is installed and
therefore enable it when the changes are committed. In Linux systems, a System-V
init script has to be added to the system to allow it to be started each time the
machine boots. If the user’s administration environment contains a new init script,
ISE-T automatically determines that the service should be started when this set
of administration changes is committed to the base system. This test also sought to
demonstrate that certain files are always going to be different between users if created
within their private environment. The SSH host key for each environment is different
because it is created based on the kernel’s random entropy pool, which is different for
each user and therefore will never be the same if created in a separate environment.
A way around this would be not to create it within the private branch of each user,
but instead to create it after the equivalent changes are committed, for example, the
first time the service’s init script is executed.
For the second test, we instructed the users to remove the PPP daemon. This test
demonstrated that there are multiple ways to remove a package in a Debian system,
and depending on the way the package is removed, the underlying file system will be
different. Specifically, a package can either be removed or purged. When a package
is removed, files marked as configuration files are left behind, allowing the packages
to be reinstalled and have the configuration remain the same. On the other hand,
193
when a package is purged, the package manager will remove the package and all the
configuration files associated with it. In this case, the users chose different ways to
remove the package, and ISE-T was able to determine the differences for those that
chose to remove or purge it.
8.4.3
Configuration Changes
Our third set of tests involved modifications to configuration files on the system and
included five separate tests in two categories. The first category was composed of
simple file configuration changes. We first instructed the users to modify the host
name of the machine persistently from debian to iset, which is accomplished by
editing the /etc/hostname file. As expected, as this configuration change is very
simple. All users modified the system’s hostname in the exact same manner, allowing
ISE-T to determine that all the systems were equivalent.
Next, we instructed the users to modify the /etc/inetd.conf to enable the
discard service. In this case, as the file is more free-form, their changes were not exact, and many were not equivalent. For example, some users enabled it for both TCP
and UDP, while others enabled it for TCP alone. Also, some added a comment, while
others did not. Whereas the first change is not equivalent, the second change should
be considered equivalent, but this cannot be determined by a simple diff. One needs
to parse the files correctly to determine that they are equivalent, an ability lacking in
our ISE-T prototype. However, ISE-T was able to clearly report the differences that
existed between the users environments.
The second category of tests involved setting up and removing cron jobs and was
composed of three tests. First, we provided the users with a script in /root that
we instructed them to install in the system in a manner so that it will be executed
194
daily. In Debian there are two ways to have a cron job execute daily. The best way
is to simply place the script in the /etc/cron.daily directory, as all the programs
located within it are executed once a day. On the other hand, a user could use
the /etc/crontab file and instruct it exactly what time each day to execute the
program. We were looking for users to place the file in /etc/cron.daily which
the large majority did; but one user who perhaps was not used to Debian modified
/etc/crontab directly. Again, ISE-T was able to detect the difference when it existed
and report the discrepancy. Similarly, we instructed users to remove a cron job that
ran hourly. The cron job’s script was already located in /etc/cron.hourly and not
in crontab. All users deleted it from the /etc/cron.hourly directory, even the
user who modified /etc/crontab in the previous test. Finally, we had an entry in
/etc/crontab that executed a cron job at 4:35 AM every day, and instructed the
users to change it to 5:35 AM. As expected, all users performed the same set of
changes by modifying the /etc/crontab file.
8.4.4
Exploit
In our final test, we asked the users to create a back door shell into the system. This
test demonstrates that ISE-T will detect when one of the two system administrators
makes a rogue administration change to the system, as the other administrator will
not have made that same administration change. We did not specify how or where
the shell should be created, and left it up to the user. The users created setuid shells
in many different locations, some as “dot files” hidden from a normal ls listing,
and some just performed setuid on existing shells. In all cases, as none of the users
made this rogue modification in the same manner, ISE-T was able to show the rogue
changes as different in every environment.
8.5
195
Related Work
Two-person control mechanisms are known to provide high levels of assurance [134].
Multiple examples exist with regard to nuclear weapons. For instance, to launch
a nuclear weapon, two operators must separately confirm that launch orders are
valid and must turn their launch keys together to launch the missiles. In fact, every
sensitive action concerning nuclear weapons must be performed by two people with
the same training and authority [39, Chapter 2]. The same notion is applied in many
financial settings: banks will require two people to be involved in certain tasks, such
as opening a safe deposit box [148], and companies can require two people to sign
a check [55] over a certain amount. This makes it much more difficult for a single
person to commit fraud.
As far as we know, this mechanism has never been applied directly to system
administration. In the Compartmented Mode Workstation (CMW), the system administration job is split into roles, so that many traditional administration actions
require more than one user’s involvement [138]. This demarcation of roles was first
pioneered in Multics at MIT [75]. Similarly, the Clark-Wilson model was designed
to prevent unauthorized and improper modifications to a system to ensure its integrity [44]. All these systems simply divided the administrators’ actions among
different users who performed different actions. This differs fundamentally from the
traditional notion of two-person control where both people do the same exact action.
More recently, many products have been created to help prevent and detect when
accidental mistakes occur in a system. SudoSH [69] is able to provide a higher level
of assurance during system administration as it records all keystrokes entered during
a session and is able to replay the session. However, while sudosh can provide an
audit log of what the administrator did, it does not provide the assurances provided
196
by the two-person control model. Even if one were to audit the record or replay
it, one is not guaranteed to get the same result. Although auditing this record can
be useful for detecting accidental mistakes, it cannot detect malicious changes. For
instance, a file fetched from the Internet can be modified. If the administrators can
control which files are fetched, they can manipulate them before and after the sudosh
session. ISE-T, on the other hand, does not care about the steps administrators take
to accomplish a task, only the end result as it appears on the file system.
Part of the reason accidental mistakes occur is that knowledge is not easily passed
between experienced and inexperienced system administrators. Although systems
like administration diaries and wikis can help, they do not easily associate specific
administration actions with specific problems. Trackle [50] attempts to solve this by
combining an issue tracker with a logged console session. Issues can be annotated,
edited and cross-referenced while the logged console session logs all actions taken
and file changes and stores them with the issue, improving institutional memory.
Although this allows less experienced system administrators to see the exact steps a
previous administrator took to fix a similar or equivalent issue, it does not actually
prevent mistakes from entering and remaining in the system, nor does it prevent a
malicious administrator from operating.
ISE-T’s notion of file system views was first explored in Plan 9 [104]. In Plan
9, it is a fundamental part of the system’s operation. As Plan 9 does not view
manipulating the file system view as a privileged operation, each process can craft
the namespace view it or its children will see. A more restricted notion of file system
views is described by Ioannidis [71]. There, its purpose is to overlay a different set of
permissions on an existing file system.
Finally, a common way to make a system tolerant of administration faults is to
use file system versioning, which allows rolling back to a configuration file’s previous
197
state if an error is made. Operating systems such as Tops-20 [53] and VMS [90]
include native operating system support for versioning as a standard feature of their
file systems. These operating systems employ a copy-on-write semantic that involves
versioning a file each time a process changes it. Other file systems, such as VersionFS [96], ElephantFS [127] and CVFS [132], have been created to provide better
control of the file system versioning semantic.
Chapter 9
Conclusions and Future Work
This dissertation demonstrates that many different types of modern computing problems can be solved in a relatively simple manner with different forms of operating
system virtualization.
First, we presented *Pod. *Pod decouples a user’s computing experience from
a single machine while providing them with the same persistent, personalized computing session they expect from a regular computer. *Pod allows different types of
applications to be stored on a small portable storage device that can be easily carried
on a key chain or in a user’s pocket, thereby allowing the user increased mobility.
*Pod uses operating system and display virtualization to decouple the computing
session from the host on which it is currently running. It combines this virtualization mechanism with a checkpoint/restart system that lets *Pod users suspend their
computing session, move around, and resume their session at any computer.
Second, we presented AutoPod. AutoPod expands on *Pod by enabling isolated
applications running within a pod to be transparently migrated across machines running different operating system kernel versions. This lets maintenance occur promptly,
as system administrators do not have to take down all applications running on a ma-
Chapter 9. Conclusions and Future Work
199
chine when it needs maintenance. Instead, the applications are migrated to a new
machine where they can continue execution. As AutoPod enables this across different kernel versions, security patches can be applied to operating systems in a timely
manner with minimal impact on the availability of application services.
Third, we presented PeaPod, an operating system virtualization layer that enables secure isolation of legacy applications. The virtualization layer leverages pods
and introduces peas to encapsulating processes. Pods provide an easy-to-use lightweight
virtual machine abstraction that can securely isolate individual applications without
the need to run an operating system instance in the pod. Peas provide fine-grained
least-privilege mechanism that can further isolate application components within
pods. PeaPod’s virtualization layer can isolate untrusted applications, preventing
them from being used to attack the underlying host system or other applications
even if they are compromised.
Fourth, we presented Strata, which improves the way system administrators manage the VAs under their control by introducing the virtual layered file system. By
addressing their contents by file location instead of block address, VLFSs allows Strata
to quickly and simply provision VAs, as no data needs to be copied into place. Strata
provides improved management, as file system modifications are isolated and upgrades
can be stored centrally and applied atomically. It also allows Strata to create new
VLFSs and VAs by composing together smaller base VLFSs and VAs that provide
core components. Strata significantly reducing the amount of disk space required for
multiple VAs, allows them to be provisioned almost instantaneously and allows them
to quickly updated no matter how many are in use. The research into Strata’s VLFS
also enabled DejaView’s ability to provide a time-traveling desktop [81]. By layering
a blank layer over the file system snapshot, DejaView was able to quickly recreate a
fully writable file system view.
200
Fifth, we presented Apiary, which introduces a new compartmentalized application desktop paradigm. Instead of running one’s applications in a single environment
with complex rules to isolate the applications from each other, Apiary allows them
to be easily and completely isolated while retaining the integrated feel users expect
from their desktop computer. The key innovations that make this possible are the use
of virtual layered file systems and the ephemeral application execution environments
they enable. The VLFS allows the multiple containers to be stored as efficiently
as a single regular desktop, while also allowing containers to be instantiated almost
instantly. This functionality enables the creation of the ephemeral containers that
provide an always fresh and clean environment for applications to run it.
Apiary’s usage model of fully isolating each application works well in many scenarios, but can cause complications in others. For instance, as each application’s file
system is fully isolated, if one wanted to send a file as an email attachment, one could
not create a new email message and attach the file to it; the email program might
not have access to the file system containing the file. Although Apiary provides a
method for users to copy files between containers, this can have an impact on users’
ability to use the system efficiently. Applying Apiary’s principles to non-desktop
environments, such as smartphones and tablets, where user interface paradigms are
not as ingrained, as on the desktop, can enable user interface metaphors that behave
seamlessly without compromising Apiary’s application isolation.
Apiary also raises a number of interesting follow-up questions as it only explores
the benefits of applications that can run in total isolation. There are smaller applications, such as browser plugins, that cannot run in total isolation, but must remain
part of a larger environment. An interesting follow-up question would be to try to see
how Apiary’s concepts apply to multiple components of a single application, where
the components cannot be run independently.
201
The ephemeral execution model introduced by Apiary provides multiple avenues
for follow-up. For instance, many network-facing services, such as mail and web
services, continuously run based on untrusted input they receive from the network.
These services have also been consistently exploited due to flaws in their programs.
However, the ephemeral execution model, as presented by Apiary, is not a perfect
fit for these services as they need some level of “write” access to the underlying system that will be persistent. An interesting area of research would be to understand
how these services operate and how ephemeral execution could be leveraged to provide more security while still allowing the persistent data storage that these services
require.
Finally, we presented ISE-T, which enables and applies the two-person controller
model to system administration. In administration, this model requires two administrators to perform the same administrative act with equivalent results for the administrative changes to be allowed to affect the system that is being modified. ISE-T
creates multiple parallel environments for the administrators to perform their administrative changes independently. ISE-T then compares the results of the administrative
changes for equivalence. When the results are equivalent, there is a high assurance
that system administration faults have not been introduced into the system, be they
malicious or accidental in nature.
ISE-T’s application of the two-person controller model is just an element of a
larger vision of applying this dual control model to solving computing problems. In
particular, we want to explore how the ability to create dual environments can provide improved systems management and security of systems in general. For system
management, patching a system is critical to ensure that it remains secure. However,
many patches can introduce new bugs as well. By being able to create two environments that run in parallel, one can test the known working system against a patched
202
system to ensure that the patch does not introduce any new faults. Similarly, it
can improve security as we can create two parallel environments that differ randomly
in areas such as their process’s address space layout and stacks. As code injection
attacks are directly tied to these layouts, by running two systems in parallel with
different layouts, an attack will result in fundamentally different results on the two
systems, allowing one to detect that an attack is occurring.
Bibliography
[1] Fakeroot. http://fakeroot.alioth.debian.org/.
[2] Gmail. https://gmail.google.com.
[3] Google Docs. https://docs.google.com.
[4] he RPM Package Manager. http://www.rpm.org/.
[5] Hotmail. http://www.hotmail.com.
[6] Linux Containers. http://lxc.sourceforge.net/.
[7] Linux VServer Project. http://www.linux-vserver.org/.
[8] Portable Firefox. http://johnhaller.com/jh/mozilla/portable_firefox/.
[9] SoX - Sound eXchange. http://sox.sourceforge.net.
[10] Stealth Surfer. http://www.stealthsurfer.biz/.
[11] Trek Thumbdrive TOUCH.
http://www.thumbdrive.com/p-thumbdrive.
php?product=tdswipecrypto.
[12] U3 Platform. http://www.u3.com.
Bibliography
204
[13] US DoD Joint Publication 1-02, DOD Dictionary of Military and Associated
Terms (as amended through 9 June 2004).
[14] Virtual Network Computing. http://www.realvnc.com/.
[15] Sendmail v.5 Vulnerability. Technical Report CA-1995-08, CERT Coordination
Center, August 1995.
[16] MIME Conversion Buffer Overflow in Sendmail Versions 8.8.3 and 8.8.4. Technical Report CA-1997-05, CERT Coordination Center, January 1997.
[17] Anurag Acharya and Mandar Raje. MAPbox: Using Parameterized Behavior
Classes to Confine Applications. In The 9th USENIX Security Symposium,
Denver, CO, August 2000.
[18] Adobe Systems Incorporated. Buffer Overflow Issue in Versions 9.0 and Earlier
of Adobe Reader and Acrobat. http://www.adobe.com/support/security/
advisories/apsa09-01.html, February 2009.
[19] Paul Anderson. LCFG: A Practical Tool for System Configuration. Usenix
Association, August 2008.
[20] http://www.aim.com/get_aim/express/.
[21] Myla Archer, Elizabeth Leonard, and Matteo Pradella. Towards a Methodology and Tool for the Analysis of Security-Enhanced Linux. Technical Report
NRL/MR/5540—02-8629, NRL, August 2002.
[22] Yeshayahu Artsy, Hung-Yang Chang, and Raphael Finkel. Interprocess Communication in Charlotte. IEEE Software, 4(1):22–28, January 1987.
Bibliography
205
[23] Dirk Balfanz and Daniel R. Simon. WindowBox: A Simple Security Model for
the Connected Desktop. In The 4th USENIX Windows Systems Symposium,
Seattle, WA, August 2000.
[24] Amnon Barak and Richard Wheeler. MOSIX: An Integrated Multiprocessor
UNIX. In The 1989 USENIX Winter Technical Conference, pages 101–112,
San Diego, CA, February 1989.
[25] Arash Baratloo, Navjot Singh, and Timothy Tsai. Transparent Run-Time Defense Against Stack Smashing Attacks. In The 2000 USENIX Annual Technical
Conference, San Diego, CA, June 2000.
[26] Ricardo Baratto, Shaya Potter, Gong Su, and Jason Nieh. MobiDesk: Mobile
Virtual Desktop Computing. In The 10th Annual ACM International Conference on Mobile Computing and Networking, Philadelphia, PA, September 2004.
[27] Ricardo A. Baratto, Leonard N. Kim, and Jason Nieh. THINC: A Virtual
Display Architecture for Thin-Client Computing. In The 20th ACM Symposium
on Operating Systems Principles, Brighton, United Kingdom, October 2005.
[28] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex
Ho, Rolf Neugebauery, Ian Pratt, and Andrew Warfield. Xen and the Art of
Virtualization. In The 19th ACM Symposium on Operating Systems Principles,
Bolton Landing, NY, October 2003.
[29] Andrew Baumann, Jonathan Appavoo, Dilma Da Silva, Jeremy Kerr, Orran
Krieger, and Robert W. Wisniewski. Providing Dynamic Update in an Operating System. In 2005 USENIX Annual Technical Conference, pages 279–291,
Anaheim, CA, April 2005.
Bibliography
206
[30] Andrew Berman, Virgil Bourassa, and Erik Selberg. TRON: Process-specific
File Protection for the UNIX Operating System. In The 1995 USENIX Winter
Technical Conference, pages 165–175, New Orleans, LA, January 1995.
[31] bitdefender.
Trojan.pws.chromeinject.b.
http://www.bitdefender.com/
VIRUS-1000451-en--Trojan.PWS.ChromeInject.B.html, November 2008.
[32] Jeff Bonwick and Bill Moore. ZFS: The Last Word In File Systems. http://
opensolaris.org/os/community/zfs/docs/zfs_last.pdf, November 2005.
[33] Kevin Borders, Eric Vander Weele, Billy Lau, and Atul Prakash. Protecting
Confidential Data on Personal Computers with Storage Capsules. In The 18th
USENIX Security Symposium, Montreal. Canada, August 2009.
[34] Ed Bugnion, Scott Devine, and Mendel Rosenblum. Disco: Running Commodity Operating Systems on Scalable Multiprocessors. In The 16th ACM Symposium on Operating Systems Principles, pages 143–156, Saint Malo, France,
December 1997.
[35] Thomas Bushnell. The HURD: Towards a New Strategy of OS Design. http:
//www.gnu.org/software/hurd/hurd-paper.html, 1994.
[36] Bruce Byfield. An Apt-Get Primer. http://www.linux.com/articles/40745,
December 2004.
[37] Ramón Cáceres, Casey Carter, Chandra Narayanaswami, and Mandayam
Raghunath. Reincarnating PCs with Portable SoulPads. In The 3rd International Conference on Mobile Systems, Applications, and Services, pages 65–78,
Seattle, WA, June 2005. ACM.
Bibliography
207
[38] Justin Capps, Scott Baker, Jeremy Plichta, Duy Nyugen, Jason Hardies, Matt
Borgard, Jeffry Johnston, and John H. Hartman. Stork: Package Management
for Distributed VM Environments. In The 21st Large Installation System Administration Conference, Dallas, TX, November 2007.
[39] Ashton B. Carter, John D. Steinbruner, and Charles A. Zraket, editors. Managing Nuclear Operations. The Brookings Institution, Washington, DC, 1987.
[40] Jeremy Casas, Dan Clark, Rabi Konuru, Steve Otto, Robert Prouty, and
Jonathan Walpole. MPVM: A Migration Transparent Version of PVM. Computing Systems, 8(2):171–216, 1995.
[41] Ramesh Chandra, Nickolai Zeldovich, Constantine Sapuntzakis, and Monica S.
Lam. The Collective: A Cache-Based System Management Architecture. In
The 2nd Symposium on Networked Systems Design and Implementation, pages
259–272, Boston, MA, April 2005.
[42] David R. Cheriton. The V Distributed System. Communications of the ACM,
31(3):314–333, March 1988.
[43] Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul,
Christian Limpach, Ian Pratt, and Andrew Warfield. Live Migration of Virtual
Machines. In The 2nd Symposium on Networked Systems Design and Implementation, pages 273–286, Boston, MA, April 2005.
[44] David D. Clark and David R. Wilson. A Comparison of Commercial and Military Computer Security Policies. IEEE Symposium on Security and Privacy,
0:184, April 1987.
Bibliography
208
[45] Commission for Review of FBI Security Programs, William Webster, chair.
Webste Report: A Review of FBI Security Programs, March 2002.
[46] Small Form Factors Committee. Specification for Self-Monitoring, Analysis and
Reporting Technology (S.M.A.R.T.). Technical Report SFF-8035, Technical
Committee T13 AT Attachment, April 1996.
[47] Roberto Di Cosmo, Berke Durak, Xavier Leroy, Fabio Mancinelli, and Jérôme
Vouillon. Maintaining Large Software Distributions: New Challenges from the
FOSS Era. EASST Newsletter, 12:7–20, 2006.
[48] Crispan Cowan, Calton Pu, Dave Maier, Jonathan Walpole, Peat Bakke, Steve
Beattie, Aaron Grier, Perry Wagle, Qian Zhang, and Heather Hinton. StackGuard: Automatic Adaptive Detection and Prevention of Buffer-Overflow Attacks. In The 7th USENIX Security Conference, pages 63–78, San Antonio, TX,
January 1998.
[49] Crispin Cowan, Steve Beattie, Greg Kroah-Hartman, Calton Pu, Perry Wagle,
and Virgil Gligor. SubDomain: Parsimonious Server Security. In 14th USENIX
Systems Administration Conference, New Orleans, LA, December 2000.
[50] Daniel S. Crosta, Matthew J. Singleton, and Benjamin A. Kuperman. Fighting
Institutional Memory Loss: The Trackle Integrated Issue and Solution Tracking
System. In The 20th Large Installation System Administration Conference,
pages 287–298, Washington, DC, December 2006.
[51] B.C. Cumberland, G. Carius, and A. Muir. Microsoft Windows NT Server 4.0,
Terminal Server Edition: Technical Reference. Microsoft Press, Redmond, WA,
August 1999.
Bibliography
209
[52] Martin Davis and Hilary Putnam. A Computing Procedure for Quantification
Theory. Journal of the ACM, 7(3):201–215, July 1960.
[53] Digital Equipment Corporation. TOPS-20 User’s Guide, January 1980.
[54] Fred Douglis and John Ousterhout. Transparent Process Migration: Design Alternatives and the Sprite Implementation. Software - Practice and Experience,
21(8):757–785, August 1991.
[55] Michael Sack Elmaleh.
Nonprofit Fraud Prevention.
http://www.
understand-accounting.net/Nonprofitfraudprevention.html, 2007.
[56] Javier Fernandez-Sanguino.
Debian GNU/Linux FAQ - Chapter 8 - The
Debian Package Management Tools.
http://www.debian.org/doc/FAQ/
ch-pkgtools.en.html.
[57] FreeBSD Project. Developer’s Handbook. http://www.freebsd.org/doc/en_
US.ISO8859-1/books/developers-handbook/secure-chroot.html.
[58] Steve Friedl. Best Practices for UNIX chroot() Operations. http://unixwiz.
net/techtips/chroot-practices.html, January 2002.
[59] Tal Garfinkel. Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools. In The 10th Annual Network and Distributed
Systems Security Symposium, San Diego, CA, February 2003.
[60] Tal Garfinkel, Ben Pfaff, and Mendel Rosenblum. Ostia: A Delegating Architecture for Secure System Call Interposition. In The 1st Network and Distributed
Systems Security Symposium, February 2004.
[61] James Gettys and Robert W. Scheifler. Xlib - C Language X Interface. X
Consortium, Inc., 1996. p. 224.
Bibliography
210
[62] Martyn Gilmore. 10Day CERT Advisory on PDF Files. http://seclists.
org/fulldisclosure/2003/Jun/0463.html, June 2003.
[63] Gnome.org. Libwnck Reference Manual. http://library.gnome.org/devel/
libwnck/.
[64] GOBBLES Security. Local/Remote Mpg123 Exploit. http://www.opennet.
ru/base/exploits/1042565884_668.txt.html, January 2003.
[65] L. Gong and R. Schemers. Implementing Protection Domains in the Java Development Kit 1.2. In The 1998 Internet Society Symposium on Network and
Distributed System Security, pages 125–134, San Diego, CA, 1998.
[66] Google. Google Chrome - Features. http://www.google.com/chrome/intl/
en/features.html.
[67] GreyMagic Security Research. Reading Local Files in Netscape 6 and Mozilla.
http://sec.greymagic.com/adv/gm001-ns/, April 2002.
[68] Philippe Grosjean. Speed Comparison of Various Number Crunching Packages
(Version 2). http://www.sciviews.org/benchmark/, March 2003.
[69] Douglas Hanks. Sudosh. http://sourceforge.net/projects/sudosh/.
[70] Joseph Heller. Catch-22. Simon and Schuster, 1961.
[71] Sotiris Ioannidis, Steven M. Bellovin, and Jonathan Smith. Sub-Operating
Systems: A New Approach to Application Security. In SIGOPS European
Workshop, Saint-Emilion, France, September 2002.
Bibliography
211
[72] Shvetank Jain, Fareha Shafique, Vladan Djeric, and Ashvin Goel. Applicationlevel Isolation and Recovery with Solitude. In The 3rd ACM European Conference on Computer Systems, pages 95–107, Glasgow, Scotland, April 2008.
[73] Michael K. Johnson. Linux Kernel Hackers’ Guide. The Linux Documentation
Project, 1997.
[74] Poul-Henning Kamp and Robert N. M. Watson. Jails: Confining the Omnipotent Root. In The 2nd International SANE Conference, MECC, Maastricht,
The Netherlands, May 2000.
[75] Paul Karger. Personal Communication, May 2009.
[76] Jeffrey Katcher. PostMark: A New File System Benchmark. Technical Report
TR3022, Network Appliance, Inc., October 1997.
[77] Jeffry O. Kephart and David M. Chess. The Vision of Autonomic Computing.
IEEE Computer, pages 41–50, January 2003.
[78] Yousef A. Khalidi and Michael N. Nelson. Extensible File Systems in Spring.
In The 14th ACM Symposium on Operating Systems Principles, pages 1–14,
Asheville, NC, December 1993. ACM.
[79] Gene Kim and Eugene Spafford. Experience with Tripwire: Using Integrity
Checkers for Intrusion Detection. In The 1994 System Administration, Networking, and Security Conference, April 1994.
[80] Calvin Ko, Timothy Fraser, Lee Badger, and Douglas Kilpatrick. Detecting and
Countering System Intrusions Using Software Wrappers. In The 9th USENIX
Security Symposium, Denver, CO, August 2000.
Bibliography
212
[81] Oren Laadan, Ricardo Baratto, Dan Phung, Shaya Potter, and Jason Nieh. DejaView: A Personal Virtual Computer Recorder. In The 21st ACM Symposium
on Operating Systems Principles, Stevenson, WA, October 2007.
[82] Butler Lampson. Accountability and Freedom. http://research.microsoft.
com/en-us/um/people/blampson/slides/accountabilityandfreedom.ppt,
September 2005.
[83] Jeffrey P. Lanza and Shawn V. Hernan. Remote Buffer Overflow in Sendmail.
Technical Report CA-2003-07, CERT Coordination Center, March 2003.
[84] Zhenkai Liang, V.N. Venkatakrishnan, and R. Sekar. Isolated Program Execution: An Application Transparent Approach for Executing Untrusted Programs. In 19th Annual Computer Security Applications Conference, Las Vegas,
NV, December 2003.
[85] Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny. Checkpoint
and Migration of UNIX Processes in the Condor Distributed Processing System.
Technical Report 1346, University of Wisconsin Madison Computer Sciences,
April 1997.
[86] Peter Loscocco and Stephen Smalley. Integrating Flexible Support for Security
Policies into the Linux Operating System. In The FREENIX Track: 2001
USENIX Annual Technical Conference, Boston, MA, June 2001.
[87] David E. Lowell, Yasushi Saito, and Eileen J. Samberg. Devirtualizable Virtual
Machines Enabling General, Single-node, Online Maintenance. In The 11th
International Conference on Architectural Support for Programming Languages
and Operating Systems, Boston, MA, October 2004.
Bibliography
213
[88] Art Manion, Shawn V. Hernan, and Jeffery P. Lanza. Buffer overflow in sendmail. Technical Report CA-2003-12, CERT Coordination Center, March 2003.
[89] David Mazières. A Toolkit for User-Level File Systems. In The 2001 USENIX
Annual Technical Conference, pages 261–274, Boston, MA, June 2001.
[90] Kirby McCoy. VMS File System Internals. Digital Press, 1990.
[91] Mark McLoughlin. QCOW2 Image Format. http://www.gnome.org/~markmc/
qcow-image-format.htm, September 2008.
[92] Microsoft. Microsoft Application Virtualization. http://www.microsoft.com/
systemcenter/appv/default.mspx.
[93] Microsoft Corp. SendMessage Function. http://msdn.microsoft.com/en-us/
library/ms644950(VS.85).aspx.
[94] Moka5.
Moka5 Technology Overview.
http://www.moka5.com/node/381,
November 2006.
[95] Sape J. Mullender, Guido Van Rossum, Andrew S. Tanenbaum, Robert van
Renesse, and Hans Van Staveren. Amoeba: A Distributed Operating System
for the 1990s. IEEE Computer, 23(5):44–53, May 1990.
[96] Kiran-Kumar Muniswamy-Reddy, Charles P. Wright, Andrew Himmer, and
Erez Zadok. A Versatile and User-Oriented Versioning File System. In The
3rd USENIX Conference on File and Storage Technologies, pages 115–128, San
Francisco, CA, March/April 2004.
[97] Rajeev Nagar. Filter Drivers. In Windows NT File System Internals: A Developer’s Guide. O’Reilly, September 1997.
Bibliography
214
[98] Gustavo Niemeyer. Smart Package Manager. http://labix.org/smart.
[99] Peter Norton, Peter Aitken, and Richard Wilton. The Peter Norton PC Programmer’s Bible: The Ultimate Reference to the IBM PC and Compatible Hardware and Systems Software. Microsoft Press, 1993.
[100] Steven Osman, Dinesh Subhraveti, Gong Su, and Jason Nieh. The Design and
Implementation of Zap: A System for Migrating Computing Environments. In
The 5th Symposium on Operating Systems Design and Implementation, Boston,
MA, December 2002.
[101] Paul A. Karger and Roger R. Schell. Multics Security Evaluation: Vulnerability
Analysis, Volume II. Technical Report ESD-TR-74-193, HQ Electronic Systems
Division: Hanscom AFB, MA, June 1974.
[102] Jan-Simon Pendry and Marshall Kirk McKusick. Union Mounts in 4.4BSD-lite.
In The 1995 USENIX Technical Conference, New Orleans, LA, January 1995.
[103] Ben Pfaff, Tal Garfinkel, and Mendel Rosenblum. Virtualization Aware File
Systems: Getting Beyond the Limitations of Virtual Disks. In 3rd Symposium
of Networked Systems Design and Implementation, San Jose, CA, May 2006.
[104] Rob Pike, David L. Presotto, Ken Thompson, and Howard Trickey. Plan 9
from Bell Labs. In The 1990 Summer UKUUG Conference, pages 1–9, London,
United Kingdom, July 1990. UKUUG.
[105] Rob Pike and Dennis M. Ritchie. The Styx Architecture for Distributed Systems. Bell Labs Technical Journal, 4(2):146–152, 1999 1999.
Bibliography
215
[106] James S. Plank, Micah Beck, Gerry Kingsley, and Kai Li. Libckpt: Transparent
Checkpointing under Unix. In The 1995 USENIX Winter Technical Conference,
pages 213–223, New Orleans, LA, January 1995.
[107] Thomas Porter and Tom Duff. Compositing Digital Images. Computer Graphics, 18(3):253–259, July 1984.
[108] Jef Poskanzer. http://www.acme.com/software/http_load/.
[109] Shaya Potter, Ricardo Baratto, Oren Laadan, Leonard Kim, and Jason Nieh.
MediaPod: A Personalized Multimedia Desktop In Your Pocket. In The 11th
IEEE International Symposium on Multimedia, pages 219–226, San Diego, CA,
December 2009.
[110] Shaya Potter, Ricardo Baratto, Oren Laadan, and Jason Nieh. GamePod:
Persistent Gaming Sessions on Pocketable Storage Devices. In The 3rd International Conference on Mobile Ubiquitous Computing, Systems, Services, and
Technologies, Sliema, Malta, October 2009.
[111] Shaya Potter, Steven M. Bellovin, and Jason Nieh. Two Person Controller
Administration: Preventing Administrative Faults through Duplication. In
The 23rd Large Installation System Administration Conference, Baltimore, MD,
November 2009.
[112] Shaya Potter and Jason Nieh. Reducing downtime due to system maintenance
and upgrades. In The 19th Large Installation System Administration Conference, pages 47–62, San Diego, CA, December 2005.
Bibliography
216
[113] Shaya Potter and Jason Nieh. WebPod: Persistent Web Browsing Sessions
with Pocketable Storage Devices. In The 14th International World Wide Web
Conference, Chiba, Japan, May 2005.
[114] Shaya Potter and Jason Nieh. Highly Reliable Mobile Desktop Computing in
Your Pocket. In The 2006 IEEE Computer Society Signature Conference on
Software Technology and Applications, September 2006.
[115] Shaya Potter, Jason Nieh, and Matt Selsky. Secure Isolation of Untrusted
Legacy Applications. In The 21st conference on Large Installation System Administration Conference, pages 117–130, Dallas, TX, November 2007.
[116] Daniel Price and Andrew Tucker. Solaris Zones: Operating System Support
for Consolidating Commercial Workloads. In 18th Large Installation System
Administration Conference, November 2004.
[117] Debian Project. DDP Developers’ Manuals. http://www.debian.org/doc/
devel-manuals.
[118] Niels Provos. Improving Host Security with System Call Policies. In The 12th
USENIX Security Symposium, Washington, DC, August 2003.
[119] Jim Pruyne and Miron Livny. Managing Checkpoints for Parallel Programs. In
The 2nd Workshop on Job Scheduling Strategies for Parallel Processing, Honolulu, HI, April 1996.
[120] Richard F. Rashid and George G. Robertson. Accent: A Communication Oriented Network Operating System Kernel. In The 8th ACM Symposium on Operating System Principles, pages 64–75, Bretton Woods, NH, December 1984.
Bibliography
217
[121] Darrell Reimer, Arun Thomas, Glenn Ammons, Todd Mummert, Bowen
Alpern, and Vasanth Bala. Opening Black Boxes: Using Semantic Information
to Combat Virtual Machine Image Sprawl. In The 2008 ACM International
Conference on Virtual Execution Environments, Seattle, WA, March 2008.
[122] Charles Reis and Steven D. Gribble.
Isolating Web Programs in Modern
Browser Architectures. In The 4th ACM European Conference on Computer
Systems, Nuremberg, Germany, March 2009.
[123] Eric Rescorla. Security Holes... Who Cares? In The 12th USENIX Security
Conference, Washington, D.C., August 2003.
[124] David Rosenthal. Evolving the Vnode Interface. In The 1990 USENIX Summer
Technical Conference, pages 107–118, June 1990.
[125] Marc Rozier, Vadim Abrossimov, François Armand, I. Boule, Michel Gien,
Marc Guillemont, F. Herrman, Claude Kaiser, S. Langlois, P. Léonard, and
W. Neuhauser. Overview of the Chorus Distributed Operating System. In
The Workshop on Micro-Kernels and Other Kernel Architectures, pages 39–70,
Seattle, WA, 1992.
[126] Jerome H. Saltzer and Michael D. Schroeder. The Protection of Information in
Computer Systems. In The 4th ACM Symposium on Operating System Principles, Yorktown Heights, NY, October 1973.
[127] Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson, Alistair C. Veitch,
Ross W. Carton, and Jacob Ofir. Deciding When to Forget in the Elephant
File System. In The 17th ACM Symposium on Operating Systems Principles,
Charleston, SC, December 1999.
Bibliography
218
[128] Constantine P. Sapuntzakis, Ramesh Chandra, Ben Pfaff, Jim Chow, Monica S.
Lam, and Mendel Rosenblum. Optimizing the Migration of Virtual Computers. In The 5th Symposium on Operating Systems Design and Implementation,
Boston, MA, December 2002.
[129] Brian K. Schmidt. Supporting Ubiquitous Computing with Stateless Consoles
and Computation Caches. PhD thesis, Computer Science Department, Stanford
University, August 2000.
[130] Glenn C. Skinner and Thomas K. Wong. ”Stacking” Vnodes: A Progress Report. In The 1993 USENIX Summer Technical Conference, pages 1–27, Cincinnati, Ohio, June 1993.
[131] Peter Smith and Norman C. Hutchinson. Heterogeneous Process Migration:
The Tui System. Software – Practice and Experience, 28(6):611–639, 1998.
[132] Craig A. N. Soules, Garth R. Goodson, John D. Strunk, and Gregory R. Ganger.
Metadata Efficiency in a Comprehensive Versioning File System. In The 2nd
USENIX Conference on File and Storage Technologies, San Francisco, CA,
March 2003.
[133] Ray Spencer, Stephen Smalley, Peter Loscocco, Mike Hibler, David Andersen,
and Jay Lepreau. The Flask Security Architecture: System Support for Diverse
Security Policies. In The 8th USENIX Security Symposium, Washington, DC,
August 1999.
[134] Peter Stein and Peter Feaver. Assuring Control of Nuclear Weapons. University
Press of America, 1987.
Bibliography
219
[135] Sun Microsystems, Inc. NFS: Network File System Protocol Specification. Technical Report RFC 1094, Internet Engineering Task Force, March 1989.
[136] Michael M. Swift, Brian N. Bershad, and Henry M. Levy. Improving the Reliability of Commodity Operating Systems. In The 19th ACM Symposium on
Operating Systems Principles, pages 207–222, Bolton Landing, NY, USA, October 2003. ACM Press.
[137] Miklos Szeredi. Filesystem in Userspace. http://fuse.sourceforge.net/.
[138] Johnny S. Tolliver. Compartmented Mode Workstation (CMW) Comparisons.
In 17th DOE Computer Security Group Training Conference, Milwaukee, WI,
May 1995.
[139] Anthony Towns.
Checking Installability is an NP-Complete Prob-
lem. http://www.mail-archive.com/[email protected]/
msg03311.html, November 2007.
[140] Satoshi Uchino.
MetaVNC - A Window Aware VNC.
http://metavnc.
sourceforge.net/.
[141] Inc. VMWare. VMware VMotion for Live Migration of Virtual Machines. http:
//www.vmware.com/products/vi/vc/vmotion.html.
[142] VMware, Inc. http://www.vmware.com.
[143] VMware Inc. VMware Worksation 6.5 Release Notes. http://www.vmware.
com/support/ws65/doc/releasenotes_ws65.html, October 2008.
[144] David Wagner. Janus: An Approach for Confinement of Untrusted Applications. Master’s thesis, University of California, Berkeley, 1999.
Bibliography
220
[145] Robert N. M. Watson. Exploiting Concurrency Vulnerabilities in System Call
Wrappers. In The 1st USENIX Workshop on Offensive Technologies, Boston,
MA, August 2007.
[146] Florian Weimer. DSA-1438-1 Tar – Several Vulnerabilities. http://www.ua.
debian.org/security/2007/dsa-1438, December 2007.
[147] Andrew Whitaker, Marianne Shaw, and Steven D. Gribble. Scale and Performance in the Denali Isolation Kernel. In The 5th Symposium on Operating
Systems Design and Implementation, Boston, MA, December 2002.
[148] Wilshire State Bank. Safe Deposit Boxes. https://www.wilshirebank.com/
public/additional_safedeposit.asp, 2008.
[149] David Wise. Spy: The Inside Story of how the FBI’s Robert Hanssen Betrayed
America. Random House, 2002.
[150] Charles P. Wright, Jay Dave, Puja Gupta, Harikesavan Krishnan, David P.
Quigley, Erez Zadok, and Mohammad Nayyer Zubair. Versatility and Unix
Semantics in Namespace Unification. ACM Transactions on Storage, 2(1):1–32,
February 2006.
[151] X/Open, editor. Protocols for X/Open PC Interworking: SMB, Version 2.
X/Open Company Ltd, 1992.
[152] Erez Zadok and Jason Nieh. FiST: A Language for Stackable File Systems. In
The 2000 USENIX Annual Technical Conference, pages 55–70, San Diego, CA,
June 2000.
Appendix A
Restricted System Calls
To securely isolate regular Linux processes, we interpose on a number of additional
system calls beyond what is necessary for other forms of virtualization. Below is a
complete list of the few system calls that require more than plain virtualization on
Linux. We give the reasoning for the interposition, where it is not self-explanatory,
and note what functionality was changed from the base system call. Most system
calls do not require more than simple virtualization to ensure isolation because virtualization of the resources itself isolates them. For example, the kill system call
cannot signal a process outside the virtualized environment because the virtualized
namespace will not map it, so the system call cannot reference the process.
A.1
Host-Only System Calls
These system calls are generally not needed in a virtualized environment and are
therefore not allowed.
1. mount – If a user within a virtualized environment were able to mount
a file system, they could mount a file system with device nodes already
Appendix A. Restricted System Calls
present and would thus be able to access the underlying system directly
in a manner not controlled by the virtualization architecture. Any file
systems that need to be mounted within the virtualized environment must
be mounted by the host.
2. stime, adjtimex, settimeofday – Allow a privileged process to adjust
the host’s clock.
5. acct – Sets the file on the host that BSD process accounting information
should be written to.
6. swapon, swapoff – Control swap space allocation.
8. reboot – Causes the system to reboot or changes Ctrl-Alt-Delete functionality.
9. ioperm, iopl – Allow a privileged process to gain direct access to underlying hardware resources.
11. create module, init module, delete module, query module – Insert
and remove kernel modules.
15. nfsservctl – Enables a privileged process inside a virtual environment
to change the host’s internal NFS server.
16. bdflush – Controls the kernel’s buffer-dirty-flush daemon.
17. sysctl – A deprecated system call that enables runtime setting of kernel
parameters.
18. clock settime – Sets the realtime clock and is only usable by processes
with privilege on a regular system.
222
A.2
223
Root-Squashed System Calls
These system calls, in general, are system calls that are useful within a virtualized
environment, but treat the privileged root user in a manner that breaks the virtualization abstraction. These can, however, be used without giving the root user any
special privilege.
1. nice, setpriority, sched setscheduler, sched setparam – These system calls let a process change its priority. If a process is running as root
(UID 0), it can increase its priority and freeze out other processes on the
system. Therefore, we prevent any virtualized process from increasing its
priority.
5. ioctl – This system call is a system call demultiplexer that allows kernel
device drivers and subsystems to add their own functions that can be
called from user space. But because functionality can be exposed that
allows root to access the underlying host, all system calls, beyond a limited
audited safe set, are squashed to user nobody, much as NFS does.
6. setrlimit – This system call allows processes running as UID 0 to raise
their resource limits beyond what was preset, thereby allowing them to
disrupt other processes on the system by using too many resources. We
therefore prevent virtualized processes from using this system call to increase the resources available to them.
7. mlock, mlockall – These system calls allow a privileged process to pin
an arbitrary amount of memory, thereby allowing a virtualized process to
lock all of available memory and starve all other processes on the host. We
224
therefore squash a privileged processes to user nobody when it attempts
to call this system call and treat it like an unprivileged process.
A.3
Option-Checked System Calls
These are system calls that are used within a virtualized environment, but can be
used in a way that can break the virtualization. Therefore, the options passed to
them are checked to ensure they are valid options for the virtualized environment.
1. mknod – This system call allows a privileged user to create special files,
such as pipes, sockets, devices, and even regular files. Because a privileged
process needs to use this functionality, the system call cannot be disabled.
However, if the process could create a device, the device would be an
access point to the underlying host system. Therefore, when a virtualized
process uses this system call, the options are checked to prevent it from
creating a device special file, while allowing the other types.
2. ustat – This system call returns information about a mounted file system,
specifically how much free space remains. This can be useful for a process within a virtualized environment, but it can also provide information
about a host’s file systems that is not accessible to the processes within
the virtualized environment. Therefore, the options passed to this system
call are checked to ensure that they match the device of a file system
available only within the virtualized environment.
3. quotactl – This system call sets a limit on the amount of space individual
users can use on a given file system. Virtualized processes are only able
to call it for file systems available within their environment.
A.4
225
Per-Virtual-Environment System Calls
These system calls are on top of the IPC, shared memory and process namespace
virtualization that was provided by Zap [100].
1. sethostname, gethostname, setdomainname, getdomainname, uname, newuname,
olduname – These system calls read and write the name for the underlying
host. We wrap these system calls to read and write a virtual environmentspecific name and allow each virtual environment to set the name independently.
8. socketcall – This system call provides access to the multitude of socket
system calls available in the kernel. Because a secure virtualized environment provides each environment with its own network namespace, this
system call is restricted to operating only on the namespace that belongs
to the virtualized environment.
9. keyctl, add key, request key – These system calls affect the key management provided by the kernel. Because keys can be associated with
user and group identifiers, they must be virtualized to a per-virtualizedenvironment namespace.
12. mq open, mq unlink, mq timedsend, mq timedreceive, mq notify, mq getsetattr
– These system calls provide access to the kernel’s POSIX message queues.
Because they are used by name, they have to be virtualized on a perenvironment basis.

downloads a PDF - Columbia Software Systems Laboratory

Transcription

Similar documents

iPPlus Sales Sheet - IP Plus Consulting, Inc.

A natural evolution of the iconic Pod, the Mega Pod takes

! Sun Casino Montecarlo !

Maxi-POD - Naturwagen

Single Serve - Hamilton Beach Commercial

Smart Node Pod - Northrop Grumman

Podbielniak® Extractor Centrifuges

Application Note: Network Tools Bus Doctor Media Card Pod Settings

HoundDog: The Story.