The Virtual Faraday Cage
Transcription
The Virtual Faraday Cage
UNIVERSITY OF CALGARY The Virtual Faraday Cage by James King A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF A MASTERS OF SCIENCE DEPARTMENT OF COMPUTER SCIENCE CALGARY, ALBERTA AUGUST, 2013 c James King 2013 Abstract This thesis’ primary contribution is that of a new architecture for web application platforms and their extensions, entitled “The Virtual Faraday Cage”. This new architecture addresses some of the privacy and security related problems associated with third-party extensions running within web application platforms. A proof-of-concept showing how the Virtual Faraday Cage could be implemented is described. This new architecture aims to help solve some of the key security and privacy concerns for end-users in web applications by creating a mechanism by which a third-party could create an extension that works with end-user data, but which could never leak such information back to the third-party. To facilitate this, the thesis also incorporates a basic privacy-aware access control mechanism. This architecture could be used for centralized web application platforms (such as Facebook) as well as decentralized platforms. Ideally, the Virtual Faraday Cage should be incorporated into the development of new web application platforms, but could also be implemented via wrappers around existing application platform Application Programming Interfaces with minimal changes to existing platform code or workflows. ii Acknowledgments I would first like to thank my supervisors Ken Barker and Jalal Kawash. Without their guidance and mentorship, and their patience and support – I would not have completed my program or produced this work. I have the utmost respect for both Dr. Barker and Dr. Kawash as professors and supervisors, and I believe that anyone would be fortunate to have their instruction and guidance. While I did not complete my degree under them, special mention is deserved for my original supervisors, Rei Safavi-Naini and John Aycock – who both gave me the initial opportunity to come and study at the University of Calgary, and also the flexibility to change my area of research afterwards. I’d also like to thank my committee members, Dr. Gregory Hagen and Dr. Payman Mohassel, as well as my neutral chair Dr. Peter Høyer. Both Dr. Hagen and Dr. Mohassel were very approachable during the final leg of my journey, and I appreciated their examination of my work. Dr. Hagen’s feedback regarding Canadian privacy law was especially welcome, and I am happy to have expanded my thesis to address that specifically. More generally, I’d like to thank the University of Calgary’s Department of Computer Science – their other faculty members, their IT staff, the department chair Dr. Carey Williams, as well as their office staff. Acknowledgments are also deserved for all the support and training I received at Florida Atlantic University and especially their Department of Mathematics and Center for Cryptology and Information Security. Without the numerous people there that helped shape and prepare me for graduate school, I would have never come to the University of Calgary or pursued the path that I took. In particular, exceptional thanks should be reserved for Dr. Rainer Steinwandt, Dr. Ronald Mullin, Dr. Spyros Magliveras, and Dr. Michal Šramka. It’s impossible for me to name everyone who has helped me along, but final thanks should go to all my friends and family who have given me their support during my studies. iii iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Premise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Organization of this Thesis . . . . . . . . . . . . . . . . . . . 1.3 Background & Motivations . . . . . . . . . . . . . . . . . . . 1.3.1 Web Applications . . . . . . . . . . . . . . . . . . . . 1.3.2 Online Social Networks as a Specific Web Application 1.4 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Defining and Describing Privacy . . . . . . . . . . . . 1.4.2 Laws, Business, and the Value of Privacy . . . . . . . 1.5 Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 The Value of Social Network Data . . . . . . . . . . . 1.5.2 Innate Risks, Threats, and Concerns . . . . . . . . . 1.6 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Access Control and Information Flow Control . . . . 1.6.2 Sandboxing . . . . . . . . . . . . . . . . . . . . . . . 1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Software and Web Applications . . . . . . . . . . . . . . . . 2.2.1 P3P and Privacy Policies . . . . . . . . . . . . . . . . 2.2.2 Better Developer Tools . . . . . . . . . . . . . . . . . 2.2.3 Empowering the End-User . . . . . . . . . . . . . . . 2.3 Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Hiding End-User Data . . . . . . . . . . . . . . . . . 2.3.2 Third-Party Extensions . . . . . . . . . . . . . . . . . 2.4 Browser Extensions . . . . . . . . . . . . . . . . . . . . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Theoretical Model . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Formal Model . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Foundations . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Information leakage . . . . . . . . . . . . . . . . . . . 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Data URIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii iii iv vii 1 1 3 4 4 5 7 8 11 22 22 28 33 34 36 38 40 40 41 41 45 46 46 46 49 50 52 54 54 55 55 65 72 73 73 76 76 v 4.2.2 Hashed IDs and Opaque IDs . . . . . . . . . . . . . . . . . . . . 4.2.3 Callbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Seamless Remote Procedure Calls and Interface Reconstruction 4.3 Information Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 URIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Application Programming Interfaces . . . . . . . . . . . . . . . . . . . . 4.5.1 Web Application Platform API . . . . . . . . . . . . . . . . . . 4.5.2 Third-Party Extension API . . . . . . . . . . . . . . . . . . . . 4.5.3 Shared Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Relationship with the Theoretical Model . . . . . . . . . . . . . 4.6 High-Level Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Accessing a Third-Party Extension . . . . . . . . . . . . . . . . 4.6.2 Mutual Authentication . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 Privacy by Proxy . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Remote Procedure Calls . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.2 Protocol Requirements . . . . . . . . . . . . . . . . . . . . . . . 4.7.3 Requirement Fulfillment . . . . . . . . . . . . . . . . . . . . . . 4.7.4 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.5 Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.6 Serialized Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.7 Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.8 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Sandboxing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Inter-extension Communication . . . . . . . . . . . . . . . . . . . . . . 4.10 Methodology and Proof-of-Concept . . . . . . . . . . . . . . . . . . . . 4.10.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10.2 Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10.3 Proof-of-Concept . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10.4 Formal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10.5 Example Third-Party . . . . . . . . . . . . . . . . . . . . . . . . 4.10.6 Facebook Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . 4.11 Effects and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11.1 A more connected web . . . . . . . . . . . . . . . . . . . . . . . 4.11.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Analysis & Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Comparisons and Contrast . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 PIPEDA Compliance . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Comparisons with Other Works . . . . . . . . . . . . . . . . . . 5.2 Time & Space Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Hashed IDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 77 78 79 82 82 83 86 86 92 93 94 94 94 98 99 100 100 101 101 103 103 105 105 106 106 108 110 110 112 113 113 116 118 119 120 122 128 129 129 129 138 142 143 5.2.2 Opaque IDs . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Access Control . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Subscriptions . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.6 Sandboxing . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.7 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Shortcomings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Personal Information Protection and Electronic Documents 5.3.2 Inter-extension Communication . . . . . . . . . . . . . . . 5.3.3 Proof-of-Concept . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Hash Functions . . . . . . . . . . . . . . . . . . . . . . . . 5.3.5 High-Level Protocol . . . . . . . . . . . . . . . . . . . . . . 5.3.6 Time & Space Complexity Analysis . . . . . . . . . . . . . 5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Enhanced Support for Legal Compliance . . . . . . . . . . 5.4.3 Callbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Inter-extension Communication . . . . . . . . . . . . . . . 5.4.5 Time & Space Complexity and Benchmarking . . . . . . . 5.4.6 URI Ontology . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi . . . . . . . . . . . . . . . . . . . . . . . . Act . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 144 145 146 146 148 148 149 149 149 150 152 152 153 153 153 154 154 155 156 157 157 159 vii List of Figures 3.1 3.2 A web application platform. . . . . . . . . . . . . . . . . . . . . . . . . . 55 An example of the generalization hierarchy for data s = h“December”, 14, 1974i 61 4.1 4.2 Internal and external realms of a platform. . . . . . . . . . . . . . . . . . Comparison of traditional extension and Virtual Faraday Cage extension architectures. Lines and arrows indicate information flow, dotted lines are implicit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Virtual Faraday Cage modeled using Decentralized Information Flow Control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information flow within the Virtual Faraday Cage. Dotted lines indicate the possibility of flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Process and steps for answering a read request from a principal. Start and end positions have bold borders, ‘no’ paths in decisions are dotted. . . . . Process and steps for answering a write request from a principal. Start and end positions have bold borders, ‘no’ paths in decisions are dotted. . Process and steps for answering a create request from a principal. Start and end positions have bold borders, ‘no’ paths in decisions are dotted. . Process and steps for answering a delete request from a principal. Start and end positions have bold borders, ‘no’ paths in decisions are dotted. . Process and steps for answering a subscribe request from a principal. Start and end positions have bold borders, ‘no’ paths in decisions are dotted. . Process and steps for notifying subscribed principals when data has been altered. Start and end positions have bold borders, ‘no’ paths in decisions are dotted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Process and steps for answering an unsubscribe from a principal. Start and end positions have bold borders, ‘no’ paths in decisions are dotted. . Steps required for authorizing and accessing a third-party extension. . . . The EMDB extension specifications . . . . . . . . . . . . . . . . . . . . . Authenticating an incoming connection from a VFC platform . . . . . . . Authenticating an incoming connection from a third-party extension . . . The Lightweight XML RPC Protocol . . . . . . . . . . . . . . . . . . . . Hypothetical prompt and example extension asking for permission to share end-user data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating a datastore and some data-items in an interactive Python environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Applying projections on data-items in an interactive Python environment Applying a transform on data-items in an interactive Python environment Composing projections and transforms together in an interactive Python environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Composing projections and transforms in an invalid way, in an interactive Python environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 74 75 80 82 87 88 89 90 91 92 93 95 96 99 99 104 109 114 114 115 115 117 4.23 Creating a view in an interactive Python environment . . . . . . . . . . . 118 4.24 Creating and accessing privacy and write policies in an interactive Python environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.25 Graph showing the potential connectivity between categories of web application platforms based on making extensions available from one platform to another. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 viii 1 Chapter 1 Introduction 1.1 Premise This thesis proposes the Virtual Faraday Cage (VFC), a novel architecture for web application platforms that support third-party extensions. The Virtual Faraday Cage advocates the idea that privacy can be completely preserved while still gaining some utility. The Virtual Faraday Cage allows for fine-grained, end-user control over their own data, forcing third-parties to disclose exactly what data they want and allowing end-users to decide to what extent, if any, that data is to be released. The Virtual Faraday Cage also allows for third-party extensions to view and process private end-user data while ensuring that these extensions are not capable of revealing that information back to the third-party itself. To date, no known proposal exists that provides an architecture that allows for this capability. This is done by combining traditional access control mechanisms, information flow control, and privacy-policy meta-data into interactions between the platform and third-parties. The Virtual Faraday Cage permits web application platforms to incorporate privacy-preservation into their systems – allowing for richer third-party extensions without necessarily making large sacrifices of end-user privacy. The Virtual Faraday Cage can be applied on both centralized and distributed web application platforms (e.g., peer-to-peer), as well as hybrid systems that use a combination of both. While modern third-party extensions are treated as remote entities with few or no local components, the Virtual Faraday Cage restructures these extensions into hybrid systems: extensions with both remote and local code, that could still provide a seamless experience for the end-user. Ideally, the Virtual Faraday Cage should be incorporated 2 into the development of new web application platforms, but could also be implemented via wrappers around existing application platform APIs with minimal changes to existing platform code or work-flows. This thesis provides a theoretical model describing how web application platforms would implement the Virtual Faraday Cage, and specifies an API for third-party extensions. A proof-of-concept implementation validating the Virtual Faraday Cage architecture is also described. With the Virtual Faraday Cage, third-parties can develop extensions to web applications that add new functionality, while still protecting end-users from privacy violations. For example, third-parties might be able to collect the necessary operating data from their users, but be unable to view the social network graph even if it provides features that utilize that graph. In the Virtual Faraday Cage, users can decide how much of their personal information is revealed, if any, to third-parties. The Virtual Faraday Cage enforces a strict information flow policy for an end-user’s data when interacting with third-party extensions. Extensions are split into two components: a remote component and a local one. The remote component of a third-party extension is the third-party’s primary interface with the web application platform, and the only mechanism through which end-user data may be obtained by the third-party. The local component of an extension is one that runs within a sandboxed environment within the web-application platform’s control. This allows for composite third-party extensions that can perform sensitive operations on private data, while ensuring that the knowledge of such data can never be passed back ‘upstream’ to the third-party itself. When a third-party extension is first granted access to an end-user’s data, and during all subsequent interaction – the platform ensures all private data that the extension obtains be either explicitly authorized by the end-user, or authorized by an explicit policy created by the end-user. End-user data would have explicit granularity settings 3 and conform to a subset of the privacy ontology [1] developed by the Privacy and Security group1 at the University of Calgary. This allows the end-user, or an agent acting on their behalf, to weigh the costs and benefits of revealing certain information to the application provider, and also forces the extension provider to specify what end-user data is used and state for what purposes the data is accessed or used. For example, an extension that processes photos on a social network application may have to obtain permission from the user for each access to that user’s photos. Validation of the Virtual Faraday Cage system is aided by a Python-based proof-ofconcept implementation. The proof-of-concept involves a social-network web application platform that allows for third-party extensions to be installed by end-users. The extension implemented is a movie rating and comparison tool, allowing users to save their movie ratings to a third-party, and also compare their movie lists with their friends to determine ‘compatibility ratings’. To better mirror reality, the third-party will request some demographic information from the end-users (age, gender, location), but can neither request nor be capable of gaining further information (such as the end-user’s friends). 1.2 Organization of this Thesis The rest of this thesis is organized as follows: The remainder of this chapter provides the background and motivations for this thesis, as well as some background needed in privacy, social networks, and other aspects of computer science that are utilized by the Virtual Faraday Cage. Chapter 2 describes related work: privacy work being done in online social networks, as well as more specific works that relate to protecting end-user data from third-parties. Chapter 3 describes the theoretical model which the Virtual Faraday Cage uses and operates within. Chapter 4 describes the architecture, API, 1 The Privacy and Security group is a subset of the Advanced Database Systems and Application (ADSA) Laboratory [2] 4 proof-of-concept, and implementation-specific information. Finally, Chapter 5 concludes this thesis and discusses future work. 1.3 Background & Motivations 1.3.1 Web Applications Web applications are a key part of our daily lives. They range from applications that allow end-users to do online banking, to applications that facilitate social networking. These applications run on a variety of web servers, and the vast majority of them are located remotely from the end-users that utilize their services. In the past, when end-users demanded a new feature for the application they were using, developers had to implement the feature themselves and roll out a new version of their application. As this is a resource-intensive task, many developers have allowed their applications to be extended by using third-party extensions. These third-party extensions, typically installed locally at the end-user’s own risk, would then be able to interface directly with the original application and extend its features beyond what it was originally bestowed. For many web applications, these extensions were installed by the host/provider either as patches on the original web application’s source or as modules that would run with the same rights and capabilities as the application itself. For applications hosted in a distributed environment, where there could be many instances of the application running on many different servers (e.g., web forums), the risks associated with such an extension architecture were not significant enough to warrant radical changes. If a few providers of that application were compromised because of malicious or poorly-written extensions, it would not affect the overall experience for all providers and all end-users. However, with the need for extensions for centralized web applications – where there is a single web site for all those accessing it – this model could 5 no longer work without modification. Here, developers would begin to market their web applications as web application platforms, allowing third-parties to write extensions that would run on top of them. While some require extensions to their web application platform to go through a vetting process [3] , other platforms have a more open model [4]. This new model for extensions allowed the web application platforms to permit the use of third-party extensions, while limiting the direct security risks and resource demands of these extensions. Instead, extensions are run remotely – hosted by third-parties – and simply interface with the web application platform’s API to operate. Despite the security advantages for the platform itself, this leads to increased risks to end-user privacy and data confidentiality, as end-user data must be transmitted to (or accessed by) remote third-parties that cannot always be held to the same levels of trust, integrity, or accountability as the original web application provider. 1.3.2 Online Social Networks as a Specific Web Application Platform Online social networks (hereafter referred to as “social networks”) are a recent and immensely popular phenomenon, and they represent a specific class of web applications. First defined as ‘social network sites’ by Boyd and Ellison in 2007 [5], they characterized social networks by rich user-to-user interaction, typically including the sharing of media (photos, videos) and the incorporation of relationships between users on the site (e.g., friends, or group memberships). According to Boyd and Ellison, such sites are defined to be “web-based services that allow individuals to (1) construct a public or semi-public profile within a bounded system, (2) articulate a list of other users with whom they share a connection, and (3) view and traverse their list of connections and those made by others within the system.” In particular, a social network typically provides web-space for a user to create their own ‘profile’ (e.g., username, hobbies, photos, etc.) and tools to keep in touch with ‘friends’ (other users) through the social network. While the definitive 6 social networks might be services such as Facebook, MySpace, or Google+, many other web applications incorporate varying levels of social network capabilities. For example, Amazon.com has some limited support for user profiles and interaction through reviews and forums – but may not be what typically comes to mind when someone is referring to a social network. A conservative estimate of the total number of registered users of some of the top social networks that operate in North America would put the number at over 579.4 million accounts, with over 285 million active users [6, 7, 8, 9, 10, 11]. In China alone, the social network and instant-messenger provider Téngxùn QQ [12] boasts over a billion registered users, of which, more than 500 million are considered active accounts. Beyond allowing users to keep in touch with current friends through messages, many social networks are used to reconnect with old friends or classmates, maintain professional connections, and keep up with the life events of friends through status updates and planned events. Popular social networks range from being themed for professional use to blogs or anything in between. While “a significant amount of member overlap [exists] in popular social networks” [13], the sheer amount of user accounts and active accounts is a good indicator of their prevalence and popularity. Social networks are an important class of web applications with regards to privacy as they can have millions of end-users and proportional amounts of sensitive and personal data about those users. Because social networks often also provide capabilities for thirdparty extensions, this makes them web application platforms as well. While there are other types of web applications that have both sensitive end-user data and potential risks to that data through third-party extensions, the prevalence of social networking makes them an ideal candidate for application of the Virtual Faraday Cage. Thus, social networks are considered a motivating example of a web application platform that would benefit from a system like the Virtual Faraday Cage. 7 1.4 Privacy It is infeasible to try to provide a comprehensive description of privacy within a reasonable space in this thesis, when accounting for historical, political, and legal contexts. Consequently, this section will only provide a brief overview of some of these aspects of privacy, with a focus on how privacy affects individuals in the context of technology and software. Over the past decade there has been much work dedicated to the definition, preservation, and issue of privacy in software systems and on the web. The earliest definitions of privacy, its laws and expectations, trace back to sources including religious texts. The United Nations’ Universal Declaration of Human Rights [14] states in Article 12 that: “No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honour and reputation. Everyone has the right to the protection of the law against such interference or attacks.” UNESCO’s Privacy Chair (United Nations Educational, Scientific, and Cultural Organization), states [15] that: “Data privacy technologies are about technically enforcing the above right in the information society. [...] Unless direct action is taken, the spread of the information society might result in serious privacy loss, especially in rapidly developing countries. The lack of privacy undermines most of all other fundamental rights (freedom of speech, democracy, etc.).” As UNESCO points out, privacy helps protect individuals and their rights; protecting them from discrimination or persecution based on information about them that would or should not have been obtained otherwise. While laws can be put in place to punish offenders, it is better to avoid the loss of privacy in the first place as it is impossible to regain privacy once it is lost. 8 1.4.1 Defining and Describing Privacy The Privacy and Security (PSEC) group2 at the University of Calgary, provides “A Data Privacy Taxonomy” [1] to frame the technological definition of privacy. The work presented by the group aims to address the definition of privacy and identify its major characteristics in a manner which can be implemented within policies, software, and database systems. The Virtual Faraday Cage abides by this taxonomy. They identify four distinct actors within the realm of privacy: data providers (endusers), data collectors (web application platform), the data repository (web application platform), and data users (third-party extensions). They also identify four distinct ‘dimensions’ of privacy: visibility (who can see the data), granularity (how specific or general it is), retention (how long can they see it for), and purpose (what can it be used for). In their paper, they propose five main ‘points’ along the visibility-axis within their privacy definition: none, owner (end-user), house (platform), third-party, and all (everyone). Similarly, they propose four main ‘points’ along the granularity-axis within their privacy definition: none, existential, partial, and specific. They also have an explicitly defined purpose-axis, as well as a retention dimension that specifies the duration of the data storage and/or under what conditions it expires. In privacy, purpose signifies what a particular piece of data is used for. For example, an end-user’s email address could be used as their account name, it could be used to send them notifications, or it could even be shared with other parties looking to contact that user. Within this taxonomy, purpose represents what a particular piece of data is allowed to be used for. By specifying what purposes data can be used for, this allows for web application platform providers and third-parties to be held accountable for how they use end-user data. Furthermore, it may also be possible to explicitly enforce certain 2 The Privacy and Security group is a part of the Advanced Database Systems and Application (ADSA) Laboratory [2] 9 purposes – for example ensuring that if certain data from a patient database is only for use in a particular treatment, a violator could be detected if that treatment was not followed through after the data was accessed. Typically, purpose is represented as a string, although other structures (e.g. hierarchical structures) can be used as well. Guarda and Zannone [16] define purpose to be the “rationale of the processing [of enduser data], on the basis of which all the actions and treatments have to be performed. [...] The purpose specifies the reason for which data can be collected and processed. Essentially, the purpose establishes the actual boundaries of data processing”. They define ‘consent’ as “[the] unilateral action producing effects upon receipt that manifests the data subject’s volition to allow the data controller to process [their] data” [16]. In the context of the Virtual Faraday Cage, consent means that the end-user has agreed to release their data to be stored and processed by the web application platform. Consent also [usually] implies that the end-user has agreed to do this for a given purpose. Guarda and Zannone also note that under privacy legislation, consent can be withdrawn, and systems implementing privacy should account for this. Obligations are requirements that the receiving party must abide by in order to store or process end-user data. In Guarda and Zannone’s paper, they define obligations as “[conditions] or [actions] that [are] to be performed before or after a decision is made.”. More concretely, if there has been consent from an end-user for a online storefront web application platform to store their email address for the purpose of notifying them of changes in their orders or updates on order delivery – then there is an obligation for that web application platform to do so. Guarda and Zannone’s work suggests that obligations are difficult to define specifically, as they may be described in both quantitative and qualitative terms: an obligation could be based on time, money, or as an order of operations. 10 Retention, as also noted in Barker et al.’s taxonomy [1], refers to how long (or until what conditions) end-user data may be stored. According to Guarda and Zannone, this is explicitly time based [16], however retention could also manifest in a way more similar to an obligation: after the purpose for using a particular data item has been satisfied, that data could be removed. The Virtual Faraday Cage borrows the notions of granularity and visibility from the PSEC taxonomy [1]. Further, it addresses all the points along the granularity axis by using projections and transforms, and by using the properties of the Virtual Faraday Cage API, which explicitly prohibits testing for the existence of data as a way of revealing information. Visibility is addressed within the Virtual Faraday Cage, although no distinction is made between ‘third-party’ and ‘all/world’ due to security and collaborative adversary concerns. Limited support for the purpose axis exists within the Virtual Faraday Cage, however it is considered unenforceable and exists purely to help inform the end-user. Retention is omitted and considered unenforceable because the Virtual Faraday Cage cannot police a third-party to ensure that all copies of end-user data are deleted when they are promised to be. However, storing retention requirements could be easily done within the Virtual Faraday Cage. Both purpose and retention are considered unenforceable by the Virtual Faraday Cage because of the assumption that a third-party cannot be trusted and that there would be no software or algorithmic method of simultaneously providing a third-party machine with end-user data, and being capable of ensuring that the data was only used for the specific purposes claimed, and that this data was properly erased after the retention period ends. 11 1.4.2 Laws, Business, and the Value of Privacy The value of protecting privacy has long been felt by the privacy and security community – however, while the value of using private data to further business interests is obvious, the value of protecting privacy has evidently not been considered as valuable by the business community. One of the first papers to make a direct and articulated argument for the protection of privacy within software from a monetary perspective was by Kenny and Borking [17], articulating the value of privacy engineering at the Dutch Data Protection Agency. In this paper, the authors begin by arguing why privacy makes sense as a bottom-line investment within software products. Among those arguments lie customer trust as well as legal and compliance requirements that currently exist as well as those that are likely to develop in the future. Kiyavitskaya et al. [18], highlighted the importance of legal compliance requirements as well as the difficulties associated with developing systems and applications that are compliant. Taken into the wider context of designing privacy-aware systems, their work would seem to suggest that flexible privacy-aware systems are desirable in that they could be adapted to fit current as well as potentially future privacy and legal requirements. Thus, developing these systems would also reduce the amount of work needed when future privacy legislation becomes enacted. In the current environment of cloud computing, the growing significance of various national and international privacy laws becomes relevant to businesses seeking to grow or connect with customers internationally [19]. Fundamentally, it is easier and more cost effective to build privacy and security into software and systems during their development rather than attempting to adjust or change things afterward. By using privacy engineering in the design and development process, a corporation can avoid the potential of future incurred costs and expand their legal capabilities of operating internationally. 12 Privacy violations can both destroy customer trust and result in significant legal damages and fees, making proactive privacy engineering even more attractive. In 2011, the first financial penalty was enacted on a Maryland healthcare provider for $4.3 million in damages [20] for being in violation of the Health Insurance Portability and Accountability Act (HIPAA). This serves to further illustrate the importance of privacy law compliance and a comprehensive understanding of privacy when building a system in the first place. In Canada, the Personal Information Protection and Electronic Documents Act (PIPEDA) [21] was enacted into law in 2000. PIPEDA’s purpose was to “establish [...] rules to govern the collection, use and disclosure of personal information in a manner that recognizes the right of privacy of individuals with respect to their personal information and the need of organizations to collect, use or disclose personal information for purposes that a reasonable person would consider appropriate in the circumstances”. PIPEDA covers both medical and health-related information, as well as personal information. Health-related information about an individual (whether living or deceased) is defined as: “(a) information concerning the physical or mental health of the individual; (b) information concerning any health service provided to the individual; (c) information concerning the donation by the individual of any body part or any bodily substance of the individual or information derived from the testing or examination of a body part or bodily substance of the individual; (d) information that is collected in the course of providing health services to the individual; or (e) information that is collected incidentally to the provision of health services to the individual.” 13 Personal information is defined as “information about an identifiable individual”, though this specifically excludes a person’s name, title, address, and phone number at a business. PIPEDA establishes that organizations must have valid and reasonable situationallydependent purposes for collecting personal information, and that the collection of personal information must be done with an individual’s consent. The only exception to this would be situations where the collection is “clearly in the interests of the individual and consent cannot be obtained in a timely way”, situations where the collection is “solely for journalistic, artistic, or literary purposes”, situations where the information is public, or situations where the collection is otherwise mandated by law. PIPEDA also establishes limited situations when an organization can utilize personal information without the consent of the individual – in particular, during emergencies that threaten the individual in question, or if the organization has reasonable grounds to believe that the individual may be breaking domestic or foreign laws (and the information is used for the purpose of investigation), or if the information is used for statistical or scholarly purposes where anonymity is ensured and the Privacy Commissioner is notified before the information is used. Additionally, organizations can utilize personal information without explicit consent where this information is publicly available, or if the information was collected as required by law or is clearly in the benefit of the individual and consent cannot be obtained in a timely fashion. Otherwise, PIPEDA compliance requires that individuals give consent to the use of their personal information. PIPEDA mandates that, with few exceptions (e.g., information made to a notary in the province of Quebec), an organization may only disclose personal information without the consent of the individual if: 1) such disclosure is required by law, 2) if the disclosure is made to a government entity because the organization has reasonable grounds to suspect that the individual in question may be in breach of laws or that the information may 14 relate to national security, 3) if the disclosure is made to protect the individual in the event of an emergency, 4) if the disclosure is made for scholarly or statistical studies that would otherwise be unachievable without the disclosure, 5) if the disclosure is made to a historical or archival organization and the purpose of the disclosure was for the preservation of the information, 6) made either after 100 years, or 20 years after the individual has died – whichever is earlier, or 7) if the information was publicly available. PIPEDA compliance also requires that organizations make available to an individual their personal information records, so long as either the individual’s health or safety is threatened by withholding or such information is severable from third-party information or if the third-party consents to such disclosure. Individuals must make these requests in writing, and PIPEDA requires that organizations follow through with these requests and provide assistance (if necessary) in a timely fashion – either making these records available to the individual or otherwise notifying them of the need for a time extension within thirty days after the date of the request. Additionally, organizations must be able to provide alternative formats for personal information records, e.g., to those with sensory disabilities. Finally, an individual also has the right to request that organizations inform the individual about any disclosures of their information to government and law enforcement entities (e.g., under a subpoena, request for information, etc.) – but the organization must first notify the government entities immediately and in writing before any response is made back to the individual in question, and the organization must get authorization from the government entity first. PIPEDA also grants exceptions to requiring the granting of access to an individual’s information to any parties in situations where “the information is protected by solicitorclient privilege, or where doing so would reveal confidential commercial information, or where doing so could reasonably be expected to threaten the life or security of another individual, or where the information was generated in the course of a formal dispute 15 resolution process, or where the information was created for the purpose of making or investigating a disclosure under the Public Servants Disclosure Protection Act.” Should an individual feel that an organization is violating or ignoring a recommended course of action defined in one or more parts of PIPEDA, that individual can write a written complaint addressed to the Privacy Commissioner. A complaint that results from the organization refusing to grant an individual access to their personal information must be filed within six months after the refusal or after the time-limit for responding to the request. Afterward, if the Commissioner believes there are reasonable grounds to investigate an organization, the Commissioner may escalate the complaint and begin an investigation. Alternatively, assuming that the complaint was filed in a timely manner, if the Commissioner believes that the individual should first exhaust other reasonable procedures or believes that the complaint should be dealt with under more suitable laws, then the Commissioner will not conduct an investigation. In the course of an investigation, PIPEDA allows for the Privacy Commissioner to audit the personal information management policies of organizations attempting to comply with the law. This may be done by the Commissioner directly, or under certain circumstances, or officers and employees to whom the auditing is delegated to. The Commissioner can also summon and enforce the appearance of organization representatives and force these representatives to give and produce evidence and testimony under oath on matters that the Commissioner considers necessary to investigate the complaint. The Commissioner can also, at any reasonable time, enter any commercial premises of the organization it is investigating and converse in private with any person on the premises as well as examine or obtain copies of records and evidence obtained on premises in relation to the ongoing investigation. 16 Should the Privacy Commissioner determine, during the investigation, that there is insufficient evidence, or that the complaints are either trivial, frivolous, vexatious, or made in bad-faith, the Commissioner may choose to suspend the investigation. Similarly, should the organization in question provide a fair and reasonable response to the complaint, or if the matter is already under investigation or already reported on, the current investigation may be suspended. Following an investigation, complaints may be escalated to dispute resolution mechanisms, and the Commissioner will issue a public report within a year after the complaint is filed. This report will include the Commissioner’s findings and recommendations, any settlements that were reached by the parties, and, where appropriate, a request that the organization in question notify the Commissioner (within a specified time frame) of any actions that they will take to address the Commissioner’s findings and recommendations, or rationale as to why they did not. Following the report publication, the individual may apply for a court hearing regarding any matter of which the complaint has been made. Additionally, the Commissioner may, with the individual’s consent, appear on behalf of the individual in a court hearing. The court may then order the organization to alter its practices to comply with the Privacy Commissioner’s recommendations, publish a notice of any action(s) taken or to be taken, and award damages to the individual – though the court is not limited to these actions. While PIPEDA violations do not inherently carry administrative monetary penalties directed at the organization in question, there are individual fines of up to $100,000 CAD for any individual who obstructs the office of the Privacy Commissioner in the course of an investigation and audit. Additionally, the results of both receiving public scrutiny, potential disruptions to business practices as investigations may be carried out on premises, as well as potential criticism and poor public image are all negative effects that many businesses will benefit from avoiding. As PIPEDA also makes assurances to 17 whistle-blowers and employees of the organization in question, relying on the secrecy of business practices to avoid scrutiny is also a poor method to avoid a potential PIPEDA investigation. Furthermore, future changes to laws may bring about monetary penalties to organizations for violations of PIPEDA or equivalent laws – so proactive compliance may be an economical option. PIPEDA places the importance of health and safety as paramount directives above all other compliance requirements – with the potential exception of an individual’s request for information. PIPEDA also makes provisions for the use of electronic written documents rather than necessitating physical documents, and it establishes standards and guidelines for doing so. It also establishes that an organization must be responsible for personal information under its control, and that it should designate one or more individuals who are responsible for that organization’s compliance with PIPEDA’s principles, the core of which are: 1) Accountability, 2) Identifying Purposes, 3) Consent, 4) Limiting Collection, 5) Limiting Use, Disclosure, and Retention, 6) Accuracy, 7) Safeguards, 8) Openness, 9) Individual Access, and 10) Challenging Compliance. Recently, the Office of the Privacy Commissioner of Canada published a report [22] regarding an on-going complaint [23] filed initially in 2008 by the Canadian Internet Policy and Public Interest Clinic (CIPPIC) under PIPEDA against Facebook. Within the findings, the report stated: “The complaint against Facebook by the Canadian Internet Policy and Public Interest Clinic (CIPPIC) comprised 24 allegations ranging over 11 distinct subjects. These included default privacy settings, collection and use of users’ personal information for advertising purposes, disclosure of users personal information to third-party application developers, and collection and use of non-users personal information. 18 [...] The central issue in CIPPIC’s allegations was knowledge and consent. Our Office focused its investigation on whether Facebook was providing a sufficient knowledge basis for meaningful consent by documenting purposes for collecting, using, or disclosing personal information and bringing such purposes to individuals attention in a reasonably direct and transparent way. Retention of personal information was an issue that surfaced specifically in the allegations relating to account deactivation and deletion and non-users personal information. Security safeguards figured prominently in the allegations about third-party applications and Facebook Mobile. [...] On four subjects (e.g., deception and misrepresentation, Facebook Mobile), the Assistant Commissioner found no evidence of any contravention of the Personal Information Protection and Electronic Documents Act (the Act) and concluded that the allegations were not well-founded. On another four subjects (e.g., default privacy settings, advertising), the Assistant Commissioner found Facebook to be in contravention of the Act, but concluded that the allegations were well-founded and resolved on the basis of corrective measures proposed by Facebook in response to her recommendations. On the remaining subjects of third-party applications, account deactivation and deletion, accounts of deceased users, and non-users personal information, the Assistant Commissioner likewise found Facebook to be in contravention of the Act and concluded that the allegations were well-founded. In these four cases, there remain unresolved issues where Facebook has not yet agreed to adopt her recommendations. Most notably, regarding third-party 19 applications, the Assistant Commissioner determined that Facebook did not have adequate safeguards in place to prevent unauthorized access by application developers to users personal information, and furthermore was not doing enough to ensure that meaningful consent was obtained from individuals for the disclosure of their personal information to application developers.”3 Since then, Facebook has made changes to their privacy policies and practices, including the addition of better default privacy settings and a “privacy tour” feature – however despite these changes, CIPPIC filed another complaint [24] in 2010 expressing their dissatisfaction with Facebook’s response and indicating that they felt that many core concerns they had were not addressed by these changes – including the lack of support for fine-grained and granular control over end-user data when shared with third-parties. PIPEDA is only one example of privacy legislation – Europe’s Data Protection Directive [25], Directive of Privacy and Electronic Communications [26], and recently passed General Data Protection Regulation [27] (effective 2016), along with the United Kingdom’s Data Protection Act [28] all exist as examples of important privacy regulation. While the United States has the Privacy Act of 1974 (and its amendment, the Computer Matching and Privacy Act of 1988) [29], as well as the Electronic Communications Privacy Act of 1986 [30], both laws can be criticized for being “out-of-date” and lax in comparison to existing and new laws in Europe and Canada. Despite this, the nature of future US laws may change, and organizations and companies seeking to do business with Canada and Europe will need to earn their trust by providing a higher standard for end-user privacy. Consequently, there is an existing, and growing, body of legal directives and regulations that should motivate parties to ensure that privacy protection is built into their systems and practices. 3 [22], pg. 3 20 Building privacy into applications and systems is advantageous from a sales perspective as well. A product with built-in ‘privacy’ can be more attractive to end-users or customers, and could potentially fulfill applicable to more business use-cases than the same product without privacy features. Additionally, a corporation that establishes a ‘brand’ that consistently uses privacy engineering in their products and cares about customer (or end-user) privacy becomes a ‘trusted brand’. A trusted brand gains additional customer loyalty, the ability to charge more for their products, and also is more capable of protecting itself against gaining a negative reputation. All of this ultimately means that protecting privacy is good for the bottom line. Kenny and Borking also cover more aspects of privacy and businesses in their paper [17], discussing privacy-related legislature around the world before describing a framework for software engineering (for software agents) that embeds privacy risk management into the actual development process. Guarda and Zannone [16] also address privacy laws and regulations, and how they are affecting (and should affect) software. This paper was briefly discussed in the beginning of this chapter with regards to defining and describing privacy. First and foremost, their work is an in-depth examination of the legal landscape in Europe and the United States with regards to privacy legislation. They also define what privacy-aware systems should incorporate – for example, purpose, consent, obligations, etc. In their paper, Guarda and Zannone mirror Kenny and Borking [17], pointing out some of the benefits and penalties of abiding by or breaking privacy regulations. The remainder of Guarda and Zannone’s work describes privacy engineering, privacy policies, privacy-aware access control, and other aspects of privacy technology and legislature as they relate to software and systems design and development [16]. While the Virtual Faraday Cage was not designed explicitly to fulfill privacy legislature requirements, using the Virtual Faraday Cage may assist web application platform providers in conforming to such legislature because of the ability to 21 fine tune privacy settings and restrict a third-party’s access to end-user information. Gellman’s [19] work on privacy and confidentiality considers risks in the “clouds”, he defines cloud computing as the “sharing or storage by [end-]users of their own information on remote servers owned or operated by others.”. He goes on to define a ‘cloud service provider’, which “may be an individual, a corporation, or other business a non-profit organization, a government agency or any other entity. A cloud service provider is one type of third-party that maintains information about or on behalf, another entity”. Gellman’s report primarily considers a cloud provider’s terms of service and privacy policy as well as applicable laws (primarily in the United States). From his point of view, all parties are considered to be honest, that is to say, they abide by their own terms of service and privacy policies. Even with this assumption, many risks to end-users still exist, for example the fact that legal protections for end-users may cease for users that utilize cloud services – alternatively, an end-user may also inadvertently break the law by using a cloud provider. The Virtual Faraday Cage assumes that the use of a given web application platform is legal and that the platform is well-behaved. While Gellman defines all cloud providers as third-parties, the Virtual Faraday Cage does not consider all cloud providers necessarily as third-parties in all situations. Additionally, for the purposes of this thesis, thirdparties are not considered necessarily honest, and the threats to privacy are considered only from third-parties providing extensions to that web application platform. Gellman’s work reinforces the viewpoint that the less a third-party knows to perform its functions, the better. 22 1.5 Social Networks As stated earlier in this chapter, social networks are a recent and popular phenomenon with millions of users. Social networks also contain meaningful and accurate data regarding their end-users, and there are many motivated parties that are or would be extremely interested in acquiring this data. Unfortunately, social networks also have inherent risks and dangers – both from within and outside the social network. Because of the immense popularity, value, and dangers of social networks, they are excellent and motivating examples of web application platforms that have high-stakes for third-parties and for millions of end-users’ personal and private information. This is why the Virtual Faraday Cage utilizes a social network web application as its implementation proof-of-concept. This section examines the evidence and arguments for these claims. 1.5.1 The Value of Social Network Data Social networks are full of meaningful and accurate data [31]. Google+, Facebook, MySpace, LinkedIn, Bebo, LiveJournal, etc. all support profile pictures, if not personal photo albums - where photos can usually be “tagged” to specify who else is in a picture and what they look like. MySpace profiles can contain such potentially sensitive information as sexual orientation, body type and height, while Facebook profiles will typically contain at least a user’s real name. Google+, Facebook, MySpace, and LinkedIn profiles can contain information about past and ongoing education and work experience - as well as location information. Because the very purpose of a social network is so that users can keep up with their friends and also keep their friends up to date about themselves, the data kept on them can be assumed to be current. Consequently, this makes the potential incentives for obtaining end-user data very high, and the potential consequences of data leakage very severe. 23 Many social network providers make no secret that they are generating revenue from advertising through their system. Both Facebook [32] and MySpace [33] allows advertisers to target their audience by attributes such as geographic location, age, gender, education, relationship status, and so on. Targeted advertising is not a new phenomenon and has existed for decades. It is also one of the more effective marketing measures available. In the past, individuals (or corporations) known as “list brokers” would produce a list of people of interest for a price so that a salesperson or company would be able to market a product or service to a more interested audience. While this list could be something as ‘innocuous’ as comprising of people living within a certain area, the lists could include such things as individuals suffering from a certain disease (e.g., asthma). Considering the richness of data and the relative ease of availability, one must presume that advertisers are very interested in using social networks for targeted ads. The first paper to identify social networks as an extremely effective venue for marketing was by Domingos and Richardson [34] in 2001. They point out that the knowledge of the underlying social network graph is of great value and usefulness in marketing. By utilizing knowledge of the underlying network structure, marketers could become more effective – for example, spending more money on ‘well-connected’ individuals and less on others. They postulated the theory that well-connected customers have a stronger influence on their peers, which in turn leads to higher sales. In essence, the common sense that if a celebrity supports a particular product, others will buy it too has been formally stated by these authors. Domingos and Richardson essentially set up the premise that social networks, and their underlying graph structure in particular, are valuable. In addition to targeted advertising, viral marketing (or “network marketing”) may also be of strong interest to advertisers [34, 35], since this allows them to reach a wide audience while targeting only a small number of individuals. If advertisers were able to send targeted ads to ‘influential’ consumers (e.g., those with a large number of friends), 24 they might be able to reach a wider audience for a much lower cost when those influential consumers re-share or post the advertisement. As such, the structure of a social network would likely be of significant interest to advertisers as well. Malicious parties are also interested in social networks: evident through the number of worms, phishing scams, and account hijackings that have taken place on sites such as Facebook and MySpace [36, 37, 38, 39, 40]. Additionally, social-predators have also been shown to be interested in social networks and have used them in the past for their own purposes such as stalking [41, 42, 43, 44, 45]. Jagatic et al. [46] reaffirms this by demonstrating a very effective attack on end-user authentication credentials by using social network topology information. ‘Phishing’[47] is a type of security attack that manipulates an end-user into believing that the attacker is actually someone else (e.g., a bank or email provider), and lures the end-user into ‘logging-in’ with their account credentials. In this paper, Jagatic et al. demonstrated that the use of a social network’s topology greatly increased the effectiveness of phishing scams. In their experiment, the authors ran two tests – one with random individuals contacting others to lure them to a fake website, and another where the individuals were asked to go to that website by people they knew. The authors found that ‘social phishing’ increased the effectiveness of phishing by up to 350% over a traditional phishing scam. Insurance companies also would likely find social network data to be very useful for their current and future needs. Individuals who have photos of rowdy drunkenness, update their social network status via mobile device while claiming to be driving, or are a member of an “alternative lifestyle” may be of interest to these companies when deciding on what rates to set for health, auto, and other insurance policies. In particular if insurance companies will become as prevalent as suggested by the Discovery channel’s documentary “2057” [48], then they may very well want to know everything about you, something that a social network can help provide. For example, one woman has ap- 25 parently lost her insurance benefits as a result of photos posted on Facebook [49]. The founder and editor-in-chief of Psych Central [50] wrote a 2011 article cautioning against anyone discussing their health or well-being on social networks because such information could be used against them, possibly wrongly, to deny them their health or insurance benefits. In a Los Angeles Times [51] article, a senior analyst for an insurance consulting firm was interviewed: “Mike Fitzgerald, a Celent senior analyst, said life insurance companies could find social media especially valuable for comparing what people will admit about lifestyle choices and medical histories in applications, and what they reveal online. That could range from ‘liking’ a cancer support group online to signs of high-risk behavior. ‘If someone claims they don’t go sky diving often, but it clearly indicates on their online profile that they do it every weekend they can get away,’ Fitzgerald said, ‘that would raise a red flag for insurers.’ Social media is ‘part of a new and emerging risk to the insurance sector’ that could affect pricing and rating of policies in the future, said Gary Pickering, sales and marketing director for British insurer Legal & General Group. But many insurance lawyers decry such practices and warn of a future when insurance companies could monitor online profiles for reasons to raise premiums or deny claims.” Policy enforcers, such as school officials, an employer, or law enforcement, are also interested in social network usage and data. Photographs showing individuals doing illicit activities (speeding [52] or drugs [53], for example) could be used as evidence to aid in prosecution by law enforcement - photographs of a party involving alcohol on a dry campus could be used as evidence for disciplinary action by university officials [54, 55, 56, 57] - and similarly, leaks regarding confidential work-related topics could be 26 used for disciplinary action by one’s employers. Gross and Acquisti’s [58] work examines the behavior of people within social networks. Specifically, the authors examined the profiles of 4,000 Carnegie-Mellon students to determine the types of information they disclosed to Facebook and their usage of privacy settings. More than 50% of the profiles revealed personal information in almost all categories examined by their study (profile image, birthday, home town, address, etc.). Furthermore, when examining the validity of profile names and identifiably of profile images, they found that 89% of the profiles examined appeared to have real names, and 55% appeared to have identifiable images. Since then, public backlashes have followed Google’s [59, 60] and Facebook’s [61, 62, 63] respective decisions to publicize data that users had expected to be private, providing evidence that privacy is becoming a more important issue for the general public. However, despite this evidence that the general populations’ awareness of privacy issues has increased, the study serves to reinforce the idea that social networks continue to be a rich source of accurate and detailed information on the majority of their end-users. Gross and Acquisti also published a second paper [31] that conducted a survey of students at their university regarding the students’ use of Facebook and their attitudes towards privacy. The authors then compared the survey results with data collected from the social network before and after the survey. In the survey, questions such as “How important do you consider the following issues in the public debate?” and “How do you personally value the importance of the following issues for your own life on a dayto-day basis?” were asked, with respondents filling in their responses to subjects such as “Privacy policy” on a 7-point scale. Privacy policies ranked 5.411 and 5.09 on the scales for public debate and personal life, respectively, which was ranked higher than ‘terrorism’ in both cases. Note that in all categories, subjects ranked the importance of public debate of privacy higher than personal importance on average. 27 It should also be noted that over 45% of those surveyed rated the highest level of worry to “A stranger knew where you live and the location and schedule of the classes you take”, as well as 36% for “Five years from now, complete strangers would be able to find out easily your sexual orientation, the name of your current partner, and your current political views”. For both scenarios, the average ratings were above 5 (5.78 and 5.55, respectively). For non-members of Facebook, the survey revealed that the importance of ‘Privacy policy’ was higher: 5.67 for non-members versus 5.3 for members. After using statistical regression on their data and accounting for age, gender, and academic status (undergrad, grad, staff, faculty), the authors discovered that an individual’s privacy attitude can be a factor in determining membership status on Facebook. Specifically, this factor is statistically significant within non-undergraduate respondents, but does not appear to be statistically significant among undergraduates – even those with high privacy concerns. Furthermore, their study found that the information provided to Facebook, when provided, was overwhelmingly accurate for every category of personal information provided (birthdays, home number, address, etc.) over 87% of the survey takers who provided such information stated that the information provided was accurate and complete. The authors also state that, “... the majority of [Facebook] members seem to be aware of the true visibility of their profile - but a significant minority is vastly underestimating the reach and openness of their own profile.” The authors also discovered that “33% of [the] respondents believe that it is either impossible or quite difficult for individuals not affiliated with an university to access [the Facebook] network of that university... But considering the number of attacks ... or any recent media report on the usage of [Facebook] by police, employers, and parents, it seems in fact that for a significant fraction of users the [Facebook] is only an imagined community.” Finally, examining profiles post-survey revealed no statistically significant change between the control group and 28 the experimental group, suggesting that the survey had little effect on an individual respondent’s privacy practices on Facebook. Acquisti and Gross’ studies, along with that of Jagatic et al., and Domingos and Richardson, all serve to reinforce the motivations of third-parties to use social network information, as well as the importance of keeping such information on end-users private. The use of private information by third-parties can reap immense benefits – both for ‘legitimate’ businesses as well as those purely malicious. On the other hand, end-users do value privacy immensely, even if in practice many may choose a less privacy-conscious service in exchange for certain features unavailable otherwise. Additionally, services that incorporate greater control of end-user privacy may benefit from increased adoption (or less rejection) by more privacy-conscious end-users, a point that was also argued in the previous section. 1.5.2 Innate Risks, Threats, and Concerns Risks and threats to end-users of social networks range in scope and severity as well as the vectors through which they can affect end-users. A large body of work [58, 31, 46, 64, 65, 66] has been published discussing the various risks and dangers associated with social networks. While some threats are innate to the nature of online social networks, or risks associated with security breaches of such sites – other threats come from other users or third-party extensions that end-users utilize. While social networks such as Facebook may require these third-parties to provide privacy policies [67], there is no way to ensure compliance or to what standards private data is protected. Publicized instances of risks that end-users have faced from the revelation of their data to other parties range from job loss to criminal prosecution, and in more extreme cases, stalking and death. Additionally, other reports have indicated that end-users could be denied a job application or a school admission on the basis of information obtained from social 29 networks. Some of the end-user data can be obtained from publicly viewable pages for the members of many social networks, which in many cases may contain more information than their members would prefer to have listed [66]. Some private data may also be accessible through alternative channels [66] or be leaked through friend (or friends-offriend) connections or by individuals whom have had their accounts hijacked [38, 39, 40]. Data can also be leaked through third-party extensions [68, 69, 65], and potentially even the social network itself [70]. Gross and Acquisti showed in their research where they studied a sample of Carnegie-Mellon University students on Facebook that 7% of the sampled women and 14.5% of the sampled men were vulnerable to real-world stalking, while 77.7% of the students were vulnerable to online stalking. Other threats that have been alluded to in previous research work [58] include the potential for social networks to be used as aides for identity theft. Numerous stories have hit the news of banks and employees losing laptops with unencrypted ‘data’ – given the richness of the data on social networks, this could easily lend a helping hand in performing such theft: social networks may include such data as a date of birth and hint at a person’s birth city, parents’ names, or other information. While the number of publicized incidents remains in the extreme minority when compared to the massive number of users, their severity should motivate further safeguarding of end-user data. These risks and concerns help illustrate why social networks are an excellent and motivating example of web application platforms with third-party extensions that need better privacy protection for end-users. Rosenblum [64] presents an overview of the various threats to end-users social networking sites are presented. Rosenblum calls social networking sites ‘digital popularity contests’ where “success [is] measured in the number of unsolicited hits to one’s page”, but contends that social networks such as Facebook are “much truer to the model of [real-life] social networking”. In one social networking site, MySpace, a user reportedly stated to the New York Times that she would accept any and all friend requests that she 30 receives. Rosenblum notes that ‘stealth advertisers’ are also utilizing social networks as a form of advertising, creating profiles for self-promotion and connecting with end-users to further that goal. In fact, and more disturbingly, MySpace users have utilized automated friend-requesting scripts to try and maximize their number of friends [71]. Rosenblum also highlights that end-users of such social networks often have a presumption of privacy that does not really exist. Consequently, what they say or post on their profiles (images of underage drinking or drug usage are ‘commonplace’) can drastically affect their future employment or even academic careers. Furthermore, this is not limited to just illegal behavior or activities, but also other aspects of an individual’s presumed private life. For example, making unflattering comments about your employers or customers [72], or having a personal profile with ‘unacceptable content’ while employed as a teacher [73]. There are numerous other examples ranging from likely poor judgment on the part of the end-user [74] to seemingly overly-harsh responses by employers [75]. Two security and privacy risks are identified by Rosenblum: ‘net speech and broad dissemination’ and the ‘unauthorized usage [of end-user data] by third-parties’. He notes that unlike real-world situations (e.g., with friends at a bar), speech on a social network is stored, harvested, and analyzed – and is more akin to “taking a megaphone into Madison Square Garden each time [you] typed in a message.” Rosemblum continues, stating that: “[Users] are posturing, role playing, being ironic, test-driving their new-found cynicism in instantaneously transmitted typed communications or instant messages. And all this on a medium that does not record irony [...]. The valence of language that allows tone to control meaning is lost. [...] [And] as the media has learned with sound bites, limiting the context of an utterance can radically distort its meaning. [...] What these social networks encourage is a culture of ambiguous signifiers which the reader is left to interpret. If a reader happens to be a hiring officer, this can have disastrous results.” 31 Because the records of our online interactions can often be permanent records, this makes the safeguarding of such data all the more important. Rosenblum also notes that corporations routinely utilize search engines to do background checks on prospective employees and often review online social networks to see what these users post online. Beyond employers, Rosenblum also acknowledges the threat of marketing firms that seek to gain access to social networking sites and their data. An online social network purchased by another company (e.g., MySpace purchased by News Corp.), or a telecommunications carrier that alters its privacy policy to state that it owns the digital content of email traffic could result in disastrous ramifications for end-user privacy. As Rosenblum states, “[News Corp.] could claim ownership of and exploit the content of MySpace, either using personal information in any way it saw fit or selling the right to use it to others.” Worse yet, there have been many documented cases of sexual predators (as Rosenblum states) or even murders committed where the victims were picked, discovered, or otherwise stalked with the aid of social networking sites [41, 45]. Wu et al. [76] analyzed the privacy policies of several social networks (Facebook, LinkedIn, MySpace, Orkut, Twitter, and Youtube) and linked them back to the PSEC Privacy Taxonomy [1] developed earlier by Barker et al. The authors also extended the category of visibility to ‘friends’ and ‘friends-of-friends’, differentiating between thirdparties outside of the social network and other users inside the social network. The authors also separated end-user data into different categories: ‘registration’ (personally identifiable and unique across the entire social network), ‘networking’ (friends or contacts), ‘content’ (end-user data, profile information, etc.), and ‘activity’ (web server logs, cookies, and other information). 32 All of the social networks analyzed would reserve the right to use the collected data for any purpose. Visibility for registration data was confined to the social network itself, however network and content data was visible at least to ‘friends’ and at most to the entire world (or anyone registered on the social network). Activity data, on the other hand, was visible both to the social networks as well as, in four out of the six examined social networks, third-parties. The granularity for all data categories was almost always specific, the exceptions being LinkedIn and Twitter, which would use aggregate activity data. Finally, retention is not mentioned in any of the examined privacy policies and can be assumed to be indefinite, with the exception to legal compliance issues (e.g., end-user under the age of 13). Bonneau et al. [66] examined how private or sensitive data could be obtained from an online social network without end-users’ knowledge. This paper primarily examines how data can be obtained from social networking sites such as Facebook, and how such data leaks out in the first place. The authors demonstrate how such data can be obtained: through public listings, creation of false profiles (spies) on the network, profile compromise and phishing, and malicious extensions to the web application platform. Additionally, limitations of the Facebook Query Language (or more generally: web application provider APIs) can also leak sensitive information. In particular, they demonstrated how the Facebook Query Language returning the number of records in a query is a form of leaking some information. Proofpoint Inc. highlighted concerns of businesses and corporations with regards to loss of private or sensitive information through email and other means, including social networks [77]. Their report was based off a survey they conducted across 220 US-based corporations which each had over 1000 employees. The report showed that over 38% employed people to monitor outbound email content, and over 32% had employees whose exclusive duty was to monitor outbound email content. Nearly half of the corporations 33 surveyed with over 20,000 employees had people who monitored outbound email content. While these numbers are for outbound email, the survey also discovered that between 40-46% of the firms were concerned with blogs, social networks, and similar activities. They found that over 34% of corporations said that their business was affected by information released through some means (email, social networking, etc..), and that over 45% of businesses had concerns over the potential leakage of information through social networking sites in particular. At least two specialized companies exist4 that operate on social networks in efforts to combat ‘bad PR’ for its clients - no doubt such businesses are interested in data stored on social networks as well. While [confidential] data loss prevention is outside the scope of this thesis, this report further highlights the very real necessity of improving end-user privacy to prevent the inadvertent leakage of sensitive information. 1.6 Security While privacy and security are dependent fields, this section will consider aspects of security unrelated to privacy. Because of the vast nature of the field, this section will only provide a limited overview of selected topics in security as they pertain to the Virtual Faraday Cage. The Virtual Faraday Cage makes use of topics in security such as access control, information flow control, and sandboxing so this section covers the necessary background. 4 Reputation.com and Zululex [78, 79] 34 1.6.1 Access Control and Information Flow Control Access control is an aspect of computer and information security that deals with models and systems for controlling access to resources. Information flow control is an area of information theoretical research that is concerned with the leaking or transmission of information from one subject to another within a system. Denning [80] first defined it as: “Secure information flow,’ or simply ‘security,’ means here that no unauthorized flow of information is possible.” The “principle of least privilege” is a well known and established guiding principle in the design of secure systems. First proposed by Saltzer [81] in 1974, it is described as a rule where, “Every program and every privileged user of the system should operate using the least amount of privilege necessary to complete the job.” The Principle of Least Privilege ensures that the impact of mistakes or the misuse of privileges are confined. The Virtual Faraday Cage conforms to these principles and avoids the pitfalls of granting third-party extensions access to overly-powerful APIs. Instead, third-party extensions are only capable of being granted access to an intentionally simple API, one with fine-grained access controls that facilitate conforming with the principle of least privilege. One of the most famous access control models that implements a rudimentary form of information flow control was the Bell-LaPadula model [82]. The original model was intended for use in government systems where classification and declassification of information, and the access to such information needed to be regulated. In their model, there existed four levels of classification: unclassified, confidential, secret, and top-secret. Subjects at a given classification level could not ‘write down’ – that is, write to data that was at a lower classification – nor could they ‘read up’ – that is, read data from a higher classification. Their model also had support for labels in the form of categories – for example a security level X could represent ‘top-secret: NASA, USAF’, indicating that only someone with an equivalent or ‘dominating’ clearance (e.g., ‘top-secret: NASA, 35 NATO, USAF’) could access such documents. Decentralized Information Flow Control, first proposed by Myers and Liskov [83], is an information flow control model that makes use of multiple principals and information flow labels to control how information is used and disseminated within an application. In their model, each owner of a particular data item can choose what other principals can access that data, and only the owners can ‘declassify’ information by either adding more principals to read access or by removing themselves as an owner. Principals that can read data can relabel the data, but only if that relabeling makes the data access more restrictive. Decentralized Information Flow Control is a natural model to choose when considering the enforcement of privacy and private data from an end-user point-of-view. The ability for each owner to specify the information flow policy for their data is a concept that is readily applied into an environment where each end-user may have differing privacy preferences for their data. Papagiannis et al. [84] demonstrates how information flow control can be used in software engineering as a new paradigm for enforcing security in a collaborative setting. They describe the usage of Decentralized Information Flow Control to accomplish this, building a system called DEFCon (Decentralized Event FLow Control). In their system, events consist of a set of ‘parts’ protected by a DEFCon label. Any unit that accesses confidential data will have to abide by the labels associated with that data, restricting its ability to communicate with ineligible units that do not already have access to that data. This allows software engineers to enforce privacy in an environment which incorporates untrusted components. 36 Futoransky and Waissbein [85] implemented the ability to add meta-data tags which related to privacy scopes to PHP variables. Their system helped keep data private by allowing developers to restrict information flow based on the tags corresponding to variables, and also by allowing for end-users with an appropriate Firefox extension to see which forms on a web page correspond to what privacy scopes. In a similar and more recent development, the National Security Agency recently published an Apache Incubator project proposal, now full Apache project [86] entitled “Accumulo”, which is a database management system based on Apache Hadoop but designed after Google BigTable. Accumulo incorporates the use of cell-level labels that can be enforced in database query calls. While there are obvious applications of this to the military and environments that need such fine-grained access-control, the use of such labels for privacy specifications is an obvious and natural application of this ability. 1.6.2 Sandboxing Sandboxing was first introduced as a term by Wahbe et al. [87] in 1993. They used the term to describe a method by which they could isolate untrusted software modules to a separate memory address space, such that if a fault occurred it could be detected and kept isolated from the rest of the software. Later, Goldberg et al. [88] defined sandboxing as “the concept of confining a helper application to a restricted environment, within which it has free reign.” As a security mechanism, sandboxing allows a host system to confine untrusted code to a ‘sandbox’ where it would be unable to damage the host system even if it contained malicious code. The Virtual Faraday Cage requires a sandboxing mechanism to operate, and this section covers some of the related work specific to web applications and sandboxing. 37 Maffeis and Taly [89] provided theoretical proofs that ‘safe’ [and useful] subsets of JavaScript exist and can be used for third-party content. Currently, embedding thirdparty content within a web page can lead to all types of security issues, even if an effort is made to filter third-party content for malicious JavaScript code. The authors examined Yahoo! AD-Safe and Facebook’s FBJS as part of motivating examples as to why current methods for filtering and rewriting JavaScript code may not be sufficient to ensure security in this setting. The authors identified ‘sub-languages’ of JavaScript that had certain desirable properties, called ‘Secure JavaScript Subsets’, which had the property that code written in these subsets would be restricted from using certain JavaScript objects with privileged abilities or functionalities, or belonging to a different JavaScript application. The authors presented three examples of practical JavaScript sub-languages which restrict usage of property names outside of the code but can still be used in meaningful ways. Another project that seems to be a parallel effort to Maffeis and Taly’s research is the Google Caja [90] project. The Google Caja project takes as input JavaScript code written in a subset of the JavaScript language and rewrites it. The new code is then restricted to local objects only, with exception of any other objects it is explicitly granted access to at run-time. In this way, Google Caja provides security through an Object-Capability access control mechanism, allowing websites the ability to decide what capabilities third-party JavaScript code can have access to. 38 1.7 Summary Web application platforms are ubiquitous and often contain valuable and sensitive enduser data. Web application platforms also allow for third-parties to add new features and functionality to these platforms by creating extensions. Unfortunately, current practices, architectures, and methodology require that end-user data be shared with third-parties in order for third-party extensions to be capable of interacting with and processing enduser data. While this may not necessarily raise concerns for all types of web application platforms and all types of end-users, there can be situations where it would. Social networks are a specific class of web application platform that contain high amounts of personally-identifiable, specific, and valuable information. As explained in Section 1.5, this data is highly sought after by many diverse parties for many different reasons – and the impact of end-user data being made available to the wrong parties can result in consequences as extreme as job loss [72, 75] or even death [45]. As a result, social networks are a compelling type of web application platform to scrutinize with regards to end-user privacy. To begin to introduce this thesis’ contributions, this chapter has provided an overview of the underlying aspects of privacy and security necessary to discuss them. This overview is provided in Section 1.4 and 1.6 respectively. The Virtual Faraday Cage borrows from the data privacy [1] presented by Barker et. al, using much of their vocabulary to define privacy and to drive if and how sensitive end-user data can be shared with thirdparties. The basics of access control, and an introduction to information flow control were introduced in Section 1.6.1. Finally, sandboxing – or “confining [an] application to a restricted environment” – is introduced in Section 1.6.2. The Virtual Faraday Cage requires sandboxing in order to function, and it is important that sandboxing is both theoretically possible as well as practical and widely available. Fortunately, both are 39 true. The next chapter moves beyond fundamentals in privacy, security, and social networks; it describes the current research landscape and what existing work attempts to address privacy issues with regards to web application platforms and social networks. Specifically, while this chapter focused on setting the basis for what social networks are and what key challenges exist in research, the next chapter will focus on proposed solutions and approaches to addressing privacy in this area as well as highlighting the current gap in research that this thesis aims to fill. 40 Chapter 2 Related Work This chapter presents related research work and publications that address the problems of privacy in web application platforms, in addition to what was discussed in Chapter 1. 2.1 Overview The vast majority of related work addressing privacy on the web and within web applications has considered privacy policies. These works examine privacy policy agreements, negotiations, and consider the use of them as an enforcement mechanism. Better tools and mechanisms that incorporate this have also been made available to developers. Another approach has been to empower the end user by allowing the end-user to make a more informed decision about what information they reveal to a website based on the privacy policy for that site. While the research in this area is both valuable and informative, it is insufficient to address the problems posed by misbehaving, malfunctioning, or dishonest parties. Specific to social networks, work has been done on hiding end-user information from the social network provider, making such information available only to other end-users using a browser extension who have also been given access to the information. However, such solutions prohibit the interaction of end-users with third-party extensions in the first place. Other work has addressed social network extensions specifically, and provided a useful foundation for the Virtual Faraday Cage. Similarly, research on browser extension security can also be applied to web application extensions. 41 The previous chapter has already provided an overview of some related research in the context of introducing and motivating the Virtual Faraday Cage. In this chapter, the majority of the related work is examined in more detail. 2.2 Software and Web Applications 2.2.1 P3P and Privacy Policies The classic approach to confronting privacy problems on the web has been to attack the problem from a privacy policy standpoint, where a privacy policy is a document stating the types of data collected and the internal practices of the organization collecting the data. The majority of research that has been done in this area has dealt specifically with issues relating to Web Services, which are services designed explicitly to be interoperable with each other and typically are not ‘front-end’ applications with which end-users interact. A recurring theme in much of the research is the idea of comparing privacy-policies to each other or to customer privacy-preferences, or negotiating a new set of policies/preferences between the parties. The World Wide Web Consortium (W3C) first published the Platform for Privacy Preferences Project (P3P) in 1998 [91]. P3P was the first concerted effort from a standardization body that was also supported by key players in industry such as IBM and Microsoft. P3P is a protocol that allows websites to express their privacy policies to end-users and their web browsers, and for users to express their privacy preferences to their web browser. When an end-user using a P3P-compliant web browser connects to a P3P-compliant website, they are notified if the website’s P3P policy conflicts with their own privacy preferences. Microsoft quickly adopted P3P in 2001 [92] in Internet Explorer 6. 42 While P3P was an important step forward in bringing better awareness of privacy concerns to both end-users and businesses, it relies on the assumptions that privacy promises made by web sites are enforceable and will be kept. However, this may not always be the case: the promising website may misbehave, or even be hijacked by malicious parties, and new owners may disregard the privacy policies they previously had established. Karjoth et al. [93] identified this problem in P3P, noting that “internal privacy policies should guarantee and enforce the promises made to customers.” In other words, a P3P policy is worthless unless it accurately reflects the internal practices of the organization in the first place. In their paper they proposed a system to translate on-going business practices that were written down in E-P3P (Platform for Enterprise Privacy Practices [94], an access control language for privacy) into a P3P policy that could be delivered to end-users. This would allow for honest parties to translate their real business practices into P3P policies that they could keep. Ghazinour and Barker’s [95] develop a lattice-based hierarchy for purposes in P3P. Ghazinour and Barker argue for the use of lattices as the logical choice for establishing a ‘purpose hierarchy’ as there is always a more general and a more specific purpose with the obvious extremes being ‘Any’ and ‘None’. Beyond representing purpose or even other privacy related hierarchies, lattices should be an obvious choice for many access-control related hierarchies (e.g., groups, and so on). Rezgui et al. [96] identified and defined many aspects of privacy within Web Services, such as: ‘user privacy’, ‘data privacy’, and ‘service privacy’ – and the concepts of ‘service composition’, where one Web Service is combined with another to form a ‘new’ Web Service from the perspective of an end-user. In the paper, they propose a system which is supposed to enforce privacy at what is essentially a third-party end-point, but their design does not address hostile environments and does not address ‘composite Web Services’ – which could be considered as a more general case of web applications with third-party 43 extensions. Despite this, the definitions established within their paper are applicable in both unaddressed scenarios. They define ‘user privacy’ as a user’s privacy profile, which consists of their privacy preferences with regard to their personal information per ‘information receiver’ and per ‘information usage’ (the purpose of using that data). They define ‘service privacy’ as a comprehensive privacy policy that specifies a Web Service’s usage, storage, and disclosure policies. Finally, ‘data privacy’ is defined as the ability for data to “expose different views to different Web Services”. Another approach to dealing with privacy problems was suggested by Dekker et al. [97] in the form of a project proposal that explores the concept of using licenses to manage the dissemination and ‘policing’ of private data. In their approach, licenses are written in a formal language and can be automatically generated from privacy requirements. Sublicenses can also be derived from a parent license, and actions can be evaluated relative to a specific license. In this way, the users of private data can utilize these licenses to enforce privacy policies themselves – or alternatively the end-users can enforce their privacy through legal means. Mohamed Bagga’s [98] thesis proposes a framework for privacy-[preserving] Web Services by incorporating the Enterprise Privacy Authorization Language (EPAL) as the enforcement mechanism for privacy. Bagga also considers the problem of comparing privacy policies, and provides an algorithm for doing so. In his introduction, Bagga further elaborates on the types of use-cases for private data exchange. Bagga describes business-to-customer (B2C) and business-to-business (B2B) scenarios for private data usage. In a “data submission” B2C scenario, the business requests sensitive data from the customer. This means that the customer should verify that the privacy practices of the business are compatible with their privacy preferences before submitting their data. In a “data disclosure” B2C scenario, a customer requests 44 sensitive data from the business. In this scenario, the business must evaluate the request against their privacy policy to determine whether or not that data will be released. In a “data request” B2B scenario, a business asks another business for sensitive data. In that scenario, the disclosing business will compare privacy policies before deciding whether or not to release the data. The last scenario described is the “data delivery” B2B scenario, where one business wants to deliver sensitive data to another business. In essence, this is simply the reverse case of the “data request” scenario, where the delivering party initiates the transaction and also must verify compatible privacy policies. As the Virtual Faraday Cage is not primarily concerned with privacy policies or privacy-aware access control, the rest of Bagga’s work is not considered further here. Xu et al. [99] follows a similar approach. In their paper, the authors examine how a composite web service could be made privacy-preserving through comparing privacy policies to an end-user’s privacy preferences. They give several examples of composite Web Services (e.g., travel agency, comparison shopping portal, and medical services) and suggest how certain aspects of a composite Web Service could be made unavailable (in an automated way) to certain customers to protect those customers’ privacy preferences. Another trend is the notion of negotiating privacy requirements between entities. For example, a customer at an online store could reveal their date of birth in exchange for special deals on their birthday or other discounts. Khalil El-Khatib [100] considers the negotiation of privacy requirements between end-users and Web Services. In this paper, the end-users either reveal additional information or expands the allowed usage of their private data in exchange for promised perks. Benbernau et al. [101] builds on the idea of negotiating privacy requirements between entities by defining a protocol where this can be accomplished in a dynamic way and potentially change the privacy agreements over time – a term they coined as ‘on-going privacy negotiations’. Daniel Le Metayer [102] advocates using software agents to represent an end-user’s privacy preferences and the 45 management of their personal data. Luo and Lee [103] propose a ‘privacy monitor’, but this requires a trusted central authority which would store all the private data of everyone else, thus simply moving the privacy problems to another entity. However, they do identify additional risks and concerns - namely, the ability for an adversary to aggregate data over a single or multiple social networks (or other websites) in an effort to breach the privacy of an individual who has tried to be careful about their data revelation. All of the work in P3P and privacy policies assumes that the promising party is honest, which can be a troublesome assumption as the only recourse for dishonest parties are lawsuits. 2.2.2 Better Developer Tools Another approach to addressing privacy concerns in software and web applications is to provide developers with better tools. To better facilitate the management of privacy policies and the protection of private data, Hamadi et al. [104] propose that the underlying Web Service protocol should be designed with privacy capabilities built into it. In their paper, they identify key aspects of privacy policies that should be ‘encoded’ into the Web Service model itself. To do so, they specify what aspects of privacy should be considered when designing a web service protocol (e.g., data, retention period, purpose, and third party specifications), specify a formal model for a ‘privacy aware Web Service protocol’, and they describe the tool that they created to help developers describe their Web Service protocol with privacy capabilities in mind. Another paper that was similar to Futoransky et al.’s [85] research mentioned in the previous chapter was Levy and Gutwin’s [105] work. Levy and Gutwin anchored P3P policy specifications to specific form fields on a website. This would allow for end-users (or software agents) to better understand and easily identify which P3P clauses applied 46 to which fields. 2.2.3 Empowering the End-User Another avenue of research has been to empower the end-user to make better decisions for themselves regarding information disclosure and software or service usage. Tian et al. [106] attempt to assign privacy threat values to websites, allowing end-users to decide if they are willing to risk their privacy to use a particular site. They propose a framework through which privacy threats can be assigned values by end-users and evaluated so that end-users can decide if they want to proceed with the dissemination of their information on a given web site or application. Overall, their approach is interesting, but it does not solve the underlying problem of how to gain utility from a third-party website without necessarily disclosing private information in the first place. This paper views privacy from the point-of-view of a cost/benefit approach. 2.3 Social Networks 2.3.1 Hiding End-User Data Guha et al. [107] advocate a mechanism called “none of your business” (NOYB) by which end-users do not store their real data on a social network, instead storing random plausible (but in reality, encrypted) data on the site and then using a web browser extension to decrypt/encrypt values. NOYB is a privacy-preservation mechanism that utilizes a novel approach to providing anonymity for end-users to third-parties, other end-users, and even the web application platform itself. In essence, NOYB facilitates the ability for users to create their own ’subnet’ of the service, where the real interaction would take place. This is done by utilizing encryption which substitutes real values of user-data with that of other plausible data. 47 NOYB implemented a novel idea which involved partitioning a user’s profile into ‘atoms’ such as {(First Name, Gender ), (Last Name), (Religion)}, and swapping them pseudo-randomly with other atoms from databases called “dictionaries”. These dictionaries would also be capable of growing over time as new entries are added to them. Users with the right symmetric key would be able to re-substitute back the original data and in effect decrypt a profile. However, some atoms could potentially reveal user information that otherwise would not make any sense and the steganographic storage for additional key information may defeat the stealthy approach. For example, combining atoms to reveal a profile of “Mohammed Mironova”, “Male”, and “Hindu” could indicate false information as it is unlikely that any such individual exists. Furthermore, only user attributes are protected and not communication. One of the limitations of NOYB is that it does not account for how communication between end-users can be achieved, or how legitimate end-user data can be shared with third-parties (or even the application provider) for legitimate and useful features and capabilities. Additional limitations may include the ability to detect steganographic hidden data, as well as the challenges associated with using NOYB to send messages or wall posts in an encrypted way. Furthermore, key-revocation and data updates with a new key are not explained in detail in the paper, and re-encrypting (and changing) stuff such as a user’s gender and name – while possible in the real world – should certainly defeat the purpose of being stealthy, in particular if this happens more than once. Additionally, while NOYB and similar projects may be able to bypass detection by web application providers, the use of NOYB may still constitute a violation of the platform’s terms of service – as NOYB essentially piggybacks on Facebook to provide its own social network experience. 48 Similar to NOYB, Luo et al. [13] propose FaceCloak, which seems to build on and improve NOYB’s idea. Here, the web application platform has no access to encrypted information as such information is completely stored on a third-party website. Consequently, the concerns about detection that the previous related work identified are no longer applicable to FaceCloak. However, the benefits and other criticisms remain the same. Lucas and Borisov [108] propose flyByNight, where they incorporated a web application extension that could encrypt/decrypt messages using a public key cryptosystem. flyByNight is an interesting exercise in creating a public-key message encryption/decryption scheme within Facebook. Unfortunately, as the authors admit, their system is not applicable in a hostile environment. This is to say, if Facebook really wanted to decrypt the messages that they were sending through their third-party application, then Facebook could easily eavesdrop and do so. Alternatively, if Facebook were compromised, this could also happen. The authors contend that some mitigation may be found by establishing a legal case against Facebook for decrypting messages; however this approach avoids this key challenge by making it someone else’s responsibility. However, if one does not trust the web application provider to begin with, why should one trust the web application provider not to tamper with flyByNight? Baden et al. [109] propose Persona as an approach to protecting privacy in social networks. In this paper, the authors propose using an Attribute-Based Encryption (ABE) approach to protecting end-user data in a social network. In their introduction, they correctly point out some of the issues with social networks (such as Facebook), in particular the legal agreement which has statements such as, ”[users] grant [Facebook] a non-exclusive, transferable, sub-licensable, royalty-free, worldwide license to use any [Intellectual Property] that [users] post on or in connection with Facebook.” Persona uses a distributed data-storage model, and requires end-users have browser extensions that can 49 encrypt and decrypt page contents. 2.3.2 Third-Party Extensions Felt and Evan’s [65] work is the most directly related work to the Virtual Faraday Cage. They correctly identify that one of the biggest threats to end-user privacy within web application platforms – in particular social platforms like Facebook and the OpenSocial API – are third-party extensions. Felt and Evans discover that the vast majority of Facebook extensions (over 90%) have access to private data that the extensions do not even claim to need. They also identify the types of applications by category, without explicitly labeling any as ‘junk’ or ‘spam’ applications. Despite this, essentially 107 of 150 applications do not provide explicitly new or novel features to Facebook and the authors are right to conclude that they do not and should not have access to private data. They go further, stating that extensions should not be given access to private data at all, and that private data can be isolated by the use of placeholders, essentially achieving a ‘privacy-by-proxy’. For example, “<uval id=’’[$self]’’ field=’’name’’/>” could be a substitute for displaying the end-user’s name: the HTML code would be replaced by “John Doe” or whatever the name field should be. The authors concede that while this method does not suffice for more sophisticated extensions, it is sufficient for the ‘majority’ of extensions. The authors state that if a more sophisticated extension needs end-user data, it should simply prompt the user – in effect, forcing the end-user to fill out another profile on a third-party website. 50 In a separate work, Felt et al. [69] address the underlying mechanics and problems of embedding third-party content (in particular, extensions) into a web application platform. They point out that existing Same-Origin-Policies [110] for code execution are insufficient for practical use because they are too restrictive: the parent page has no access to an inline frame’s content. On the other hand, directly embedding a third-party’s scripts presents a significant risk for exploits, as non-standardized code-rewriting can lead to costly oversights. In the paper, the authors propose that browser vendors (and, indirectly, standards bodies like the W3C) provide a new standard for isolating untrusted code from the rest of the Document Object Model (DOM). This isolation would allow one-way interaction from a parent to a child. Recent developments seem to have taken into account the authors’ concerns. The Google Caja [90] project rewrites and prohibits untrusted JavaScript code from accessing the DOM or any other objects without explicit capabilities granted to it from the parent page. Essentially, this allows Caja to act as a sandboxing mechanism. 2.4 Browser Extensions Barth et al.’s [111] work primarily addresses browser extensions, but many of the lessons can be extrapolated to apply to web application platform extensions as well. The authors examined 25 Firefox extensions and concluded that 88% of them needed fewer privileges than they were granted, and that 76% of them had access to APIs with more capabilities than they needed – compounding the difficulty of reducing the extensions’ privileges. Instead of focusing on the potential for malicious third-party extensions to exist, the authors focus on the potential for other malicious parties to exploit bugs or flaws in third-party extension design – and leverage the APIs and capabilities to which those third-party extensions have access. Their research concludes that Firefox should build a 51 new extension platform addressing these issues and concerns. Felt et al. [112] extents their previous work [111] by expanding their study of thirdparty ‘extensions’ to the Google Chrome browser and the Android operating system. Here, the authors focus on three benefits that install-time permissions have: limited extension privileges (by conforming to the ‘principle of least privilege’), user consent, and the potential for benefits when it comes to reviewing extensions. For example, the social aspect of reviews and end-user feedback can minimize the usage of malicious extensions. Furthermore, the listing of install-time capabilities also allows for security researchers to concentrate on extensions with more dangerous capabilities. In the context of Firefox and Chrome, this allows (and would allow for) official reviewers to speed up the process of examining extensions before they are published to official repositories. To facilitate this, the authors propose a permission ‘danger’ hierarchy, where more ‘dangerous’ permissions are emphasized over less ‘dangerous’ ones. Fredrikson and Livshits [113] argue that the browser should be the data-miner of personal information, and that it should be in control of the release of such information to websites and third-parties. Furthermore, it can support third-party miners whose source code is statically verified to be unable to leak information back to those thirdparties. 52 2.5 Summary There are many approaches to addressing the issue of privacy in web applications. One of the common approaches to addressing privacy is through the use of privacy policies and technologies that support them such as the Platform for Privacy Preferences Project (P3P) [91]. In this area, work has been done to allow businesses to express their internal workflows into P3P policies [94] and better express purposes through the use of a lattice hierarchy [95]. Another approach to addressing privacy through the use of privacy policies was by considering licenses as a way to manage the dissemination of end-user data [97]. Beyond protocols, other research examined the possibility of providing better tools for developers to create privacy-aware web services, for instance, adding P3P privacy specifications automatically to form fields [105]. Finally, providing a tool for end-users to utilize to better gauge the risk to their privacy associated with a specific website was also proposed by Tian et al. [106]. This chapter also surveyed several works [13, 107, 108, 109] specific to social networks that examine the potential of hiding end-user data from the social network provider itself. Two other works [65, 69], also social network specific, deal with third-party extensions to these social networks. In one [69], the authors propose a “privacy-by-proxy mechanism for protecting end-user data from third-parties. Their method restricts third-parties from obtaining any end-user data from the social network platform directly, and they propose that extensions requiring more detailed information should obtain that information external to the social network. In the other [65], the authors of that paper propose that new standards for isolating untrusted JavaScript code should be implemented by browsers. To a similar end, recent projects such as Google Caja [90] seem to have taken their concerns into account. 53 Research related to browser extensions [111, 112, 113] was also examined for any potential applications to web application platforms. In two of these works [111, 112], the authors argued that install-time permissions for extensions allowed end-users to make better and more educated decisions about whether to install a browser extension or not. In the other paper [113], the authors argue for a “danger hierarchy for permissions, which would allow end-users to clearly see which permissions are more risky than others. While the on-going research concerning privacy policies is important and fundamental to data privacy, such works assume or require that the promising party is honest. Thus, using privacy policies as the sole enforcement mechanism becomes troublesome when the promising party cannot be guaranteed to behave honestly. Furthermore, few of the works examined addressed the problem of third-parties and end-user privacy instead focusing directly on privacy issues between end-users and the web application platform they are using. Of the works that did, the authors proposed a “privacy-by-proxy mechanism [69], and ultimately, better sandboxing mechanisms [65] for embedded third-party code. Consequently, there exists a gap in current research that has not been addressed: can third-party extensions to web application platforms work with end-user data in a way that prohibits end-user privacy from being violated? This is the gap that this thesis aims to address and fill. The next chapter introduces the theoretical model for this thesis. The theoretical model will define the vocabulary used by this thesis to present its contribution, the Virtual Faraday Cage. Additionally, observations and propositions will be provided that make privacy guarantees for systems that comply with the model. 54 Chapter 3 Theoretical Model This chapter presents the theoretical model used by the Virtual Faraday Cage, starting with definitions and then continuing on to describe abstract objects and operations that can be used to define privacy violations and protect against them. 3.1 Basics A web application is a service accessible over the internet through a number of interfaces. These interfaces can be implemented through web browsers or custom software, and may run on different ports or use different protocols. A web application platform (hereafter referred to as a platform) is a web application which provides an API so other developers can write extensions for the platform to provide new or alternative functionality. These developers are referred to as third-parties. Traditionally, extensions to web application platforms are web applications themselves: they are hosted by third-parties and can be accessed by the web application platform (and vice versa) through API interfaces, and often by end-users through a web browser or other software. End-users are the users of the web application platform whose privacy the Virtual Faraday Cage aims to protect. Figure 3.1 shows this model. End-user data considered sensitive by the platform represents all data that the enduser can specify the access control policies for with respect to other principals such as third-party extensions. Private data represents a subset of sensitive data which is strictly prohibited from being accessed by any remote extension component. 55 Figure 3.1: A web application platform. 3.2 Formal Model The formal model for The Virtual Faraday Cage is presented here. These definitions and observations allow the Virtual Faraday Cage to establish privacy guarantees. 3.2.1 Foundations Definition 3.1. Data Let D be the set of all data contained within a web application platform. Let Du ⊆ D be the subset of D that contains all data pertaining to an end-user u, as specified by a particular platform. The Virtual Faraday Cage defines data to be representable in vector form, thus ∀di ∈ Du , di = hx0 , ..., xn−1 i where n > 0. Furthermore, xi is considered to be either a string, a number, or some other atomic type within the context of the web application platform. The special data value NULL can be represented as a 0-dimensional vector hi. 56 By convention, data is represented in “monospaced font” with quotes (except for NULL and numbers), and classes of data or “attributes” are capitalized and without quotes, for example: Age, Gender, Date, Occupation. Definition 3.2. Sensitive data Let Su ⊆ Du be a subset of all data pertaining to u that is considered “sensitive”. Sensitive data represents all data pertaining to an end-user that the end-user can specify access control policies for. Depending on the specific web application platform, sensitive data may consist of different types of end-user data. A web application platform that requires that all users have a publicly viewable profile picture, for instance, would force that data, by definition, to not fall into the category of sensitive data because end-users have no control over the access control policies for that profile picture. Consequently, data can only be classified as sensitive if the end-user has control over whether or not that data is disseminated to other users or third-parties. Traditionally, determining whether or not data is “sensitive” has been left to the end-user to decide. This thesis departs from this trend, instead leaving that distinction to the web application platform to decide. This is done for definitional purposes rather than philosophical: an end-user may still have reservations about being forced to have a public profile picture, but if the end-user has no control over how that data is shared with other end-users or third-parties, then it is not considered sensitive data with respect to this model. The next two definitions will introduce third-parties and extensions. 57 Definition 3.3. Third-parties A third-party θ, represents an entity in the external realm. Third-parties can control and interact with extensions belonging to them, and they can collude or otherwise share data with any other third-parties. All data visible to an extension’s remote component (Definition 3.4) is visible to the third-party owning it as well. Additionally, no assumptions are made regarding the computational resources of a third-party. The set of all third-parties is T. The definition of third-parties also departs from traditional definitions. In Barker et al. [1], a third-party is any party that accesses and makes use of end-user (“dataprovider”) data. A house, in their terminology, is a neutral repository for storing that data. Consequently, a web application platform that both stores and utilizes end-user data has roles both as a house as well as a third-party. In this model however, web application platforms (or “service-providers”) are not defined as third-parties. Instead, this definition is strictly reserved for other parties that obtain end-user data from the web application platform. In this way, third-parties are similar to the concept of an adversary from other security literature. In the Virtual Faraday Cage’s threat model, it is third-parties that the VFC aims to defend against. Next, extensions must be defined: Definition 3.4. Extensions An extension, denoted by E, is a program that is designed to extend the functionality of a given web application platform by using that platform’s API. All extensions must be comprised of either a remote extension component (denoted as E), a local extension component (denoted as E 0 ), or both. The set of all extensions is denoted by E. 58 The ownership relation o : e ∈ {E, E, E 0 } −→ θ ∈ T is a many-to-one mapping between extensions and/or extension-components and third-parties that “own” them. Furthermore, every extension E is “owned” by some third-party θ. This is expressed as ∀e ∈ {E, E, E 0 }, ∃θ ∈ T such that o(E) = θ. Thus, o represents the ownership relation. Here, extensions are envisioned as other web applications that interact with the web application platform. The splitting of extensions into local and remote components is necessary for the operation of the Virtual Faraday Cage. Local extension components run within a sandboxed environment controlled by the web application platform, while remote extension components run on third-party web servers. The distinction between local and remote components is covered in detail in the next chapter. Now that data, third-parties, and third-party extensions have been defined, projections and transformations will be introduced. These allow for data to be modified before revelation to a third-party, facilitating fine-grained control over end-user data. Definition 3.5. Projections and transformations Suppose that a certain class of data has a fixed dimension n and is of the form hx0 , ..., xn−1 i. Fix hy0 , ..., ym−1 i as a vector where 0 ≤ yi ≤ n − 1. Then, a projection Py0 ,...,ym1 : hx0 , ..., xn−1 i −→ hxy0 ...xym−1 i, is a function that maps a data vector hx0 , ..., xn i, to a new vector that is the result of the original vector “projected” onto dimensions y0 , ..., ym1 in that order. A transform τ is an arbitrary mapping of a vector hx0 , ..., xn i to another vector hy0 , ..., ym i. The transform may output a vector of a different dimension, and may or may not commute with other transforms or projections. Consequently, every projection can be considered a type of transform, but not every type of transform is a projection. 59 Together, projections and transforms can be used as a method through which data can be made more general and less specific. This allows the Virtual Faraday Cage to give end-users the ability to control the granularity and specificity of information they choose to release to third-parties. Projections essentially allow the selective display (typically, reduction) of different dimensions of a particular data item. For example, P0,3,2 (hx0 , ..., xn i) = hx0 , x3 , x2 i and P0,0,0,0 (hx0 , ..., xn i) = hx0 , x0 , x0 , x0 i. If si = h“December”, “2nd”, 1980i represents a sensitive date (e.g., birth-date) held within a particular platform, then P0,2 (si ) = h“December”, 1980i represents a projection of that date onto the month and year dimensions. On the other hand, transforms are arbitrary functions that operate on the vector space of data. Suppose a transform τ could map a date to another date where the month has been replaced by the season. Thus, in this example, P0,2 (τ (si )) = τ (P0,2 (si )) = h“Winter”, 1980i; this composition is also called a view. Definition 3.6. Views A view is a particular composition of one or more projections and/or transforms and is a mapping from vector hx0 , ..., xn i to another vector hy0 , ..., ym i. Depending on the type of data, a context-specific generalization hierarchy or concept hierarchy, can exist for views when applied to data. A domain generalization hierarchy, as defined by Sweeney et al. [114], is a set of functions that impose a linear ordering on a set of generalizations of an attribute within a table (e.g, “Postal Codes”, “Age”, etc.), where the minimal element is the original range of attribute values, and the maximal value is a completely suppressed element equivalent to a NULL value. This allows for 60 selective revelation of end-user data (e.g., the value “Adult” instead of “25”), giving the end-user more control over their personal information and keeping data more private. The Virtual Faraday Cage makes use of this concept when defining granularity and generalization, see Section 4.10.4 for examples. Definition 3.7. Granularity Granularity refers to how specific the view is of a particular data item. For each particular data item, a different generalization hierarchy may exist. A generalization hierarchy is a lattice with the most specific data value at one end, and the least specific data value (e.g., NULL) at the other. Each directed edge of the lattice graph represents a level of generalization from the previous data value. Depending on the generalization hierarchy, one view v 0 of data may be a derivative from another view v: if a path exists from v to v 0 , then v 0 is a derivative (or generalization) of v. Otherwise, v 0 is not a generalization of v. If v 0 is considered to be a derivative of v, this can be represented as v 0 ← v. For definitional purposes, this relationship is considered reflexive: ∀v, v a view, v ← v. As an example, the view v(s) = s is specific and exact, whereas v(s) = NULL is not. Figure 3.2 shows an example generalization hierarchy for a Date data item. In the example, h1974i is a generalization of h“December”, 1974i, but it is not a generalization of h“December”, 14i. This thesis now introduces the concept of a “principal” – an entity which can interact with (potentially reading or writing to) sensitive end-user data, constrained by these policies. The following definition defines what a principal is within this model. 61 Figure 3.2: An example s = h“December”, 14, 1974i of the generalization hierarchy for data Definition 3.8. Principals A principal is an entity that can potentially access sensitive end-user data, and for which data access policies can be created for by the end-user for their data. The set of all principals is denoted as P. Principals within the Virtual Faraday Cage include extension components (E and E 0 ), and potentially other end-users or other objects or “actors”, depending on the particular web application platform. The next definitions will introduce the Virtual Faraday Cage’s access control model: reading data will be handled by the associated privacy policies for that data, and writing data will be handled by the associated write-policies specific to that data. 62 Definition 3.9. Privacy policies Let a privacy-policy be defined as a tuple of the form htype, viewi associated with a particular data item si ∈ Su . Let type represent a value in {single-access, request-on-demand, always}, and let view represents a particular view of the data (a sequence of projections and transforms). Let pA (si ) be defined as the mapping between a given data-item si ∈ Su and the privacy-policies that are associated between it and a principal A ∈ P. Then pA (si ) represents the privacy-policies of a given data-item relative to a given principal. As there can be more than one type of privacy policy associated with a given dataitem (single-access, request-on-demand, or always), pA (si ) will return a set of tuples of the form {htype, viewi, ...}. Because there are only three possible access types, this ensures 0 ≤ |pA (si )| ≤ 3. In the Virtual Faraday Cage’s model, a principal cannot read or obtain the existence of a data item unless a privacy policy for that data item is specified permitting this, and then the principal can only obtain the view of the data item as specified by the privacy policy. The single-access policy would allow an accessing party one-time access to the data item, which can be enforced by the web application platform checking a list of past-accesses to the data item. The request-on-demand policy would require that the end-user authorize each access request for the data in question. Finally, the always policy always allows access to the data. For example, let si be of the form hCountry, Province, Cityi, representing some geolocational data. Suppose that an extension, when first activated or used for the first time, should know the country, and province or state that the user is located in at that moment – but otherwise be restricted to only seeing the current country. A privacy-policy reflect- 63 ing that would be one where: pA (si ) = {hsingle-access, P0,1 (si )i, halways, P0 (si )i}. Definition 3.10. Write-policy Let wA (si ) be a function that returns the write-access ability for a particular si ∈ Su for a given principal −1 wA (si ) = 0 1 A. In particular: if A cannot write/alter si if A can write/alter si with explicit end-user permission if A can always write/alter si Furthermore, unless an end-user has explicitly set the write-policy for a data-item si to be a specific value, wA (si ) = −1. In order to facilitate “social” extensions, or extensions that can work between multiple end-users by sharing some information, local extension cache spaces are also introduced: Definition 3.11. Local extension cache space The local extension cache space is a data item that is controlled by the web application platform, and is denoted as CE,u for a given extension E and user u. The local extension cache space serves as an area for a local extension component to store data and information required for the extension to operate. While the web application platform can always choose to arbitrarily deny any read and write operations, by default the following hold: wE (CE,u ) = 1, wE 0 (CE,u ) = 1, |pE (CE,u )| = ∅, and pE 0 (CE,u ) = {always, CE,u }. 64 These default settings allow both extension components to write to the shared extension cache, but prohibit the remote extension component from being able to read from it. The local extension component can always read from the cache with full visibility. Furthermore, provisions are made so that the web application platform can always choose to deny a read or write request. This allows the model to take into consideration situations that may arise when the extension uses up all its allocated space in its local extension cache, or if the web application platform chooses to not grant any local storage space to extensions. Consequently, the local extension cache space serves as a strictly inbound information flow area, allowing for the remote component of a third-party extension to write to it but prohibiting it from reading from it. Additionally as there is no requirement for unique and separate data storage, a third-party could, for example, cache the same data globally across all instances of their extensions. Finally, before the threat of information leakage can be introduced and countermeasures discussed, the sets of all visible and writable data must be introduced: Definition 3.12. The set of all visible data Let VA,u ⊆ Su represent the set of all visible data (to some extent) for a particular end-user u and to a particular principal A. Then, VA,u = {s | s ∈ Su , pA (s) 6= ∅}. Definition 3.13. The set of all writable data Let WA,u ⊆ Su represent the set of all writable or alterable data for a particular end-user u and to a particular principal A. Then, WA = {s | s ∈ Su , wA (s) ≥ 0} . Note that W is not necessarily a subset of V (for any given A or u), and depending on the particulars, this may allow for situations where a Bell-LaPadula[82] or similar model for access control could be followed. 65 3.2.2 Information leakage Enforcing read and write access through privacy-policies and write-policies as defined in the previous definitions are sufficient for a “naı̈ve” implementation of the Virtual Faraday Cage: principals can only read or write data if an explicit policy exists allowing them to do so, or if explicit end-user permission is obtained in the process. However, what if an end-user grants permissions such that there is an overlap of readable and writable data between a principal outside of a sandbox (e.g., E) with a principal that is within it (e.g., E 0 )? As it turns out, if a strict separation of visibility and writability is not enforced between remote and local extension components, it is possible to reveal data that privacy policies do not permit, thus constituting information leakage and a privacy violation. Observation 3.1. Simply enforcing privacy and write-access policies is insufficient for preventing information leakage. Explanation: It is sufficient to demonstrate that it is possible for an extension to exist such that an end-user prohibits a third-party’s access to some confidential data, but for that third-party to gain access to it through the use of a local extension component. Suppose an end-user u installs an extension E which has remote (E) and local (E 0 ) components. Since the end-user has full control over how they choose to share or limit their sensitive information revelation to third-parties, the end-user prohibits E from accessing some data s1 , but allows it to access some data s2 . The end-user also can allow E 0 to access s1 and write to s2 . At this point, pE (s1 ) = ∅, meaning that E should be unable to view s1 , and consequently the third-party would not be privy to that information either. 66 However, |pE 0 (s1 )| > 0, meaning that E 0 can read s1 . Because it can also write to s2 , it can now leak some information about s1 to E by encoding it and writing to s2 , consequently leaking information back to the third-party. While it may be tempting to argue that the example above is simply a case of poor policy choices, the fact that it can occur despite the lack of an explicit privacy-policy allowing for it indicates that the system as defined is insufficient for preventing information leakage. Asides from the trivial case where E 0 has the same access restrictions as E (and thus, access to the exact same data and capabilities), if E 0 is granted any additional read capabilities then it must be ensured that E 0 can never leak this data back to E and thus back to the third-party. The next observation examines the requirements to prevent this. Observation 3.2. In order for a single extension to be prohibited from leaking information back to the third-party, its local component must be prohibited from writing to any data that its remote component can read. Explanation: If ∃s ∈ Su such that pE (s) = ∅ and |pE 0 (s)| > 0 (the non-trivial case), then there exists a possibility for information to leak from E 0 to E as proved in Observation 3.1. To prevent this, E 0 must be prohibited from writing to any data that E can read. More formally, the following should hold: ∀s ∈ Su , if |pE (s)| > 0 =⇒ wE 0 (s) = 0 Another way of representing this is by thinking of this as prohibiting communication overlaps between E and E 0 . Specifically, VE,u ∩ WE 0 ,u = ∅ must hold. Thus, any data that is visible to a remote component cannot be written to by a local component, preventing information leakage from local components. By ensuring this, for any given 67 extension E, E 0 would be unable to leak information back to the third-party. However, while necessary, this condition alone is insufficient for preventing all possible ways for information leakage to occur. Colluding third-parties and their multiple extensions with differing and overlapping Vs and Ws for a given end-user u could bypass this requirement, as it only requires this hold for a single extension at a given point in time. This could be fixed by keeping track of which s has ever been written to by a given extension Ei ’s local component and then denying a read-request from any other extension Ej ’s remote component (where i 6= j), but this approach is analogous to moving s to a subset of Su that is unreadable for remote extension components. Because of this, the Virtual Faraday Cage defines a subset of sensitive data that is specifically reserved for local extension components to write to, and explicitly prohibits remote components from reading such data. Similarly, local extension components are restricted from writing to data outside of this subset. Definition 3.14. Private data Let Xu represent the set of private data belonging to an end-user u, where Xu ⊆ Su . Then ∀s ∈ Xu and ∀E, pE (s) = ∅. In other words, no remote extension components can access any data in Xu . This thesis makes a distinction between private and sensitive data – for both architectural reasons, as well as functional: certain types of information you are willing to reveal to certain parties under certain circumstances, but other types of information you may choose to keep private to all parties. For instance, your name and gender may be considered sensitive information, and you may be willing to reveal one or both to third- 68 parties in certain circumstances – but your sexual orientation or religious affiliation may be something you consider more private. Definition 3.15. Revised Local Extension Components In Definition 3.4 extensions were defined and local extension components were introduced. Now a new constraint is introduced to all local extension components: No local extension component can write to data outside of a user’s private data set. This is represented by: ∀si ∈ Su \ Xu , ∀E 0 , wE 0 (si ) = −1. Defining private data that is accessible only to local extension components allows for a strict separation of data; a locally running extension E 0 can never taint data visible to some other remote extension E, as E 0 should be strictly restricted to writing only to data in Xu . Furthermore, if data in Xu can never be removed from Xu , then it will never be possible for any third-party to learn anything about data within Xu . This is because Definitions 3.14 and 3.15 prevent that data from being revealed to remote extension components, which are the only components of an extension capable of communicating back to third-parties. By default, private data can never be declassified. Despite this, the Virtual Faraday Cage model does not explicitly prohibit an end-user from declassifying private data, and this does not break any of the previous observations so long as the assumption is that the end-user is the only entity that can know if data should be kept private or not. 69 Definition 3.16. Visible and writable private data Let V0 A,u ⊆ Xu represent the set of all visible private data (to some extent) for a particular end-user u and to a particular principal A. Then, V0 A,u = {s | s ∈ Xu , pA (s) 6= ∅}. In other words, V0 A,u is the subset of private data where there exists a privacy-policy relative to A. Let W0 A,u ⊆ Su represent the set of all writable or alterable private data for a particular end-user u and to a particular principal A. Then, W0 A = {s | s ∈ Xu , wA (s) ≥ 0}. In other words, W0 A,u is the subset of private data where there exists a write-policy that may allow for A to write to that data. Definition 3.17. Basic privacy violations A basic privacy violation occurs when a third-party θ obtains data that it should not have access to. More formally: Given {Ea1 , ..., Eak } where o(Eai ) = θ for all 1 ≤ i ≤ k, and θ obtains any knowledge about s ∈ Su : if @Eg ∈ {Ea1 , ..., Eak } such that pEg (s) 6= ∅, then a basic privacy violation has occurred. This definition is sufficient to cover obvious examples of privacy violations: If an enduser u did not grant authorization to view some data si ∈ Su to any extensions owned by a third-party θ, but that third-party somehow got access to it – then a basic privacy violation has occurred. However, the definition of a basic privacy violation could be considered “weak” in that it may be possible for a third-party to obtain full knowledge about s even though it is only authorized for partial knowledge. As a consequence, generalized privacy violations are also defined. 70 Definition 3.18. Generalized privacy violations A generalized privacy violation occurs when a third-party θ obtains a view of data that it does not have access to, or otherwise is unable to derive from the data it has access to. More formally: Given {Ea1 , ..., Eak } where o(Eai ) = θ for all 1 ≤ i ≤ k, and θ obtains a view v 0 of s ∈ Su : if @Eg ∈ {Ea1 , ..., Eak } such that ∃haccess-type, vi ∈ pEg (s) where v 0 ← v, then a generalized privacy violation has occurred. Note that transforms are defined as deterministic mappings for data, and that the definition of generalized privacy violations only covers deterministic transforms. This means that if non-deterministic transforms are used, it may be possible for a privacy violation to go undetected. Furthermore, the verb “obtains” refers explicitly to obtaining information from the web application platform. If a third-party θ guesses the correct value of data, it is not a violation of privacy by Definitions 3.16 and 3.17. However, if non-deterministic transforms were allowed, this would require alteration of the definition of privacy violations so that they would encompass the potential for a third party θ to obtain an unauthorized view of data through the platform with a probability better than what is granted through the end-user’s privacy policies. Observation 3.3. Non-private data (s ∈ Su , but s ∈ / Xu ) cannot be protected against privacy violations. Explanation: Let θ1 and θ2 be two third-parties with their own sets of extensions, E1 = {Ea1 , ..., Eak } and E2 = {Eb1 , ..., Ebk }, where all extensions in E1 are owned by θ1 and all extensions in E2 are owned by θ2 . Let s ∈ Su be data that is not shared with any extensions in E1 (that is, |pEai (s)| = ∅ ∀Eai ∈ E1 ), but that s is shared with some extension in E2 . Thus, θ2 knows the value of s, and by definition, can share this value with θ1 . This type of attack is called a collusion attack, and serves to emphasize that for 71 security purposes one third-party is indistinguishable from another. Consequently, any data that is not explicitly kept private from all third-parties and their remote extension components can be shared with other third-parties, violating enduser privacy policies and preferences. Because Observation 3.3 has demonstrated that non-private data can never be fully protected from privacy violations, it is important to define a more specific type of privacy violation that relates only to private data: Definition 3.19. Critical privacy violations A critical privacy violation is a privacy violation that occurs when the data item in question is also a member of the private data set for that user. Given {Ea1 , ..., Eak } where o(Eai ) = θ for all 1 ≤ i ≤ k. If θ obtains knowledge about s ∈ Xu , then a critical privacy violation has occurred. Proposition 3.1. Abiding by privacy-policies and write-policies prevents critical privacy violations Proof: Suppose a third-party θ has obtained information about s ∈ Xu . Because s can only be read by local extension components, this implies that at least one local extension component has accessed s. However, these local extension components can only write to other data in Xu , or to the local extension cache space, which is not readable by remote extension components. Thus, either a remote extension component can read s ∈ Xu , or a remote extension component can read the local extension cache space. This is a contradiction of the specifications and definitions of both private data (Definition 3.14) as well as local extension cache spaces (Definition 3.11). Consequently, critical 72 privacy violations are not possible within the Virtual Faraday Cage. Note that enabling non-deterministic transforms has no effect on the prevention of critical privacy violations. 3.3 Summary This chapter presented the theoretical model used by the Virtual Faraday Cage. In Section 3.1 the vocabulary used by the model and the Virtual Faraday Cage is introduced and defined, and in Section 3.2 and 3.3 the formal model is introduced. Besides introducing the vocabulary used in the remainder of this thesis, this chapter also provided a fundamental security guarantee in Proposition 3.1. Specifically, by marking a subset of end-user data as private, and by abiding by the established rules such as privacy-policies and write-policies, one can prevent critical privacy violations within a given system. This result is the basis for the Virtual Faraday Cages privacy guarantees, and is built on in the next chapter. The next chapter, introduces the Virtual Faraday Cage, the main contribution of this thesis. 73 Chapter 4 Architecture This chapter presents the Virtual Faraday Cage, a new architecture for extending web application platforms with third-party extensions. It presents an overview of the architecture, its features, as well as high-level and low-level protocol information. A proof-ofconcept implementation of the Virtual Faraday Cage is also described in this chapter. 4.1 Preamble In electrical engineering, a Faraday Cage is a structure that inhibits radio communication and other forms of electromagnetic transmissions between devices within the cage and devices outside. Consequently, a Faraday Cage can be thought of as restricting information flow between the interior and exterior of the cage. Faraday Cages are used for device testing, and for hardening military, government, and corporate facilities against both electromagnetic-based attacks as well as information leakage. The Virtual Faraday Cage simulates this by placing untrusted extension code within a sandbox and inhibiting its communication with any entities outside of the sandbox. The most significant difference between the traditional architecture for web application platforms and their third-party extensions, and Virtual Faraday Cage’s architecture, is that the latter applies information flow control to how information is transmitted between the platform and third-parties. In particular, by utilizing a sandboxing mechanism, it becomes possible to run third-party code that can be guaranteed to run ‘safely’: the untrusted code is limited in computational capabilities and can only access certain method calls and programming language capabilities. 74 To help clarify the specific role of the Virtual Faraday Cage, it helps to separate the architectural components and actors into two different areas: the external and internal realms. The external realm (relative to a given web application platform and a single end-user) consists of all third-parties and their extensions. The internal realm, consists of infrastructure associated with running the web application platform, along with the end-user’s own system (see Figure 4.1). Figure 4.1: Internal and external realms of a platform. The Virtual Faraday Cage focuses primarily on countering potential privacy violations from the external realm, and consequently makes the assumption that all principals in the internal realm are ‘well-behaved’ by conforming to the model. In other words, the web application platform can enforce the privacy-policies and write-policies of all principals acting within the internal realm. On the other hand, third-parties are free to collude with each other and lie to end-users and the web application platform. Despite this however, the Virtual Faraday Cage can still provide some security and privacy guarantees. 75 The Virtual Faraday Cage splits traditionally remotely hosted extensions into two components, one that is located remotely and outside of the platform’s control, and one that is hosted locally by the platform (See Figure 4.2). While all extensions to the platform require at least one of these components, they are not required to have both. This means that existing extensions do not have to change their architecture significantly to work with a Virtual Faraday Cage based platform. Furthermore, developers without their own server infrastructure could write extensions that only run locally on the web application platform itself. Figure 4.2: Comparison of traditional extension and Virtual Faraday Cage extension architectures. Lines and arrows indicate information flow, dotted lines are implicit. 76 4.2 Features Apart from security and privacy properties, the Virtual Faraday Cage, and its implementation, offers several advantageous and distinct features: hashed IDs, opaque IDs, callbacks, seamless remote procedure calls, and interface reconstruction. 4.2.1 Data URIs While neither a novel nor new feature, the Virtual Faraday Cage utilizes Uniform Resource Identifiers (URIs) to capture data hierarchy and organization. For example, data://johndoe/profile/age could be a URI representing an “age” data value for a particular end-user, “John Doe”. Facebook utilizes a similar URI structure for data, for example: https://graph.facebook.com/<user-id> would result in returned data-items as keyed entries in a JSON dictionary [115], such as “first name” and “last name”. While the Virtual Faraday Cage does not specify a global URI scheme or ontology for data, it is conceivable that a universal scheme could be constructed. On the other hand, allowing individual web application platforms (or even categories of platforms) to decide on their own URI scheme allows for these platforms to easily adapt the Virtual Faraday Cage to fit their existing systems. Data URIs are covered more thoroughly in Section 4.4. 4.2.2 Hashed IDs and Opaque IDs All IDs used within the implementation are hashed IDs: they are represented in a way that does not give out any information about either the number of records in the database, nor the sequential order of a particular record. Specifically, hashed IDs (h ∈ Z2 256 ) are outputs from SHA-256, though any suitable cryptographic hash function can be used (as discussed in Chapter 5). In more generic terms, a user-id no longer conveys information about when that user registered relative to another user, nor does it convey information 77 about how many users there might be within an entire web application. Similarly, when applied to things such as posts or comments, ascertaining how active or inactive a user may be is no longer possible through a “side channel” like ID numbers. Opaque IDs, are extension-specific IDs for other objects within a web application platform. Within this implementation, opaque IDs exist only for end-users, however a more complete system would apply the same technique for all other objects. In the proofof-concept, this is a computational and storage intensive task, and is omitted. Opaque IDs help inhibit one extension from matching user IDs with another extension because a given user has a different opaque ID for two different extensions. Consequently, opaque IDs may be considered as an aid to privacy-preservation. 4.2.3 Callbacks For all API function calls, callback information can be passed as an optional parameter so that the Virtual Faraday Cage can send a [potentially] delayed response to the thirdparty extension. This provides two benefits: 1) third-party extensions do not have to wait for processing to occur on the back-end, or wait for end-user input, and 2) callbacks can be delayed or dropped, making it unclear to the third-party when the end-user was online and denied a request. In the latter case, this helps prevent leakage of information regarding when a user was online or not. Callbacks also allow for innovative application of privacy policies and data-views: for a given set of privacy-policies, a less-specific view of data could be provided automatically, and then later updated if the end-user chooses to allow the extension to see a more specific view of that data. 78 As an example, suppose that an end-user authorizes a third-party extension to always be able to see the current city that they are located in, but requires that third-party extension to obtain explicit on-request consent for more specific data. The extension could then always keep track of the user’s general area, but whenever the user wants to let the extension know exactly where he is, the extension is granted that through a callback with the specific value for that end-user’s location. In the context of a map-like extension, or geo-locational social networking – the third-party always knows the general area that the user is in, but can only know exactly where the user is when the user wants to by actively using the extension. Because the third-party would always be able to get broad geo-locational information, a hypothetical map-view could easily be kept open – and would be able to update itself when (and if) the end-user allows it. 4.2.4 Seamless Remote Procedure Calls and Interface Reconstruction As an architecture, the Virtual Faraday Cage can operate with any compatible remoteprocedure-call protocol. However, in the development of the Virtual Faraday Cage, the Lightweight XML Remote-procedure-call Protocol (LXRP) was developed and implemented in Python. Consequently, the following features are specific to implementations of the Virtual Faraday Cage that are built on LXRP. Remote Procedure Calls are as seamless as possible within the Virtual Faraday Cage because LXRP allows for any type of function or method to be exposed to remote clients. Furthermore, custom objects can be serialized and passed between client and server, allowing for rich functionality beyond passing native data-types. Finally, exceptions raised on the remote end can be passed directly back to the client, allowing for developers to write code that utilizes RPC functionality as though it were local. Future work could allow the direct exposure of objects to clients rather than manually specifying methods, but this would largely be a stylistic improvement rather than a functional one. 79 Interface reconstruction allows for any web application platform, or agent acting on their behalf, to immediately access remote functions locally upon connection to the extension without needing to necessarily access external resources to obtain documentation on what particular functionality is available beforehand. Specifically, interface reconstruction brings the remote objects and functionality to the local scope of either the developer or the application using the interface. This differs from existing [Python] RPC libraries in that the exact function interface is reconstructed locally. This allows for syntax errors to be caught immediately, and documentation to be accessed locally. Future work could also allow for local typeverification, so many exceptions and errors can be caught locally without needing to query the remote server. Asides from these benefits, interface reconstruction also aids developers and testers, as they can easily see what functions are available to them and directly call functions through more natural syntax such as “api.myfunc()” rather than something like “call("myfunc", [])” – as these functions have been reconstructed in the local scope. 4.3 Information Flow Control In Myers and Liskov’s Decentralized Information Flow Control model [83], the Virtual Faraday Cage can best be represented by Figure 4.3. In their model, information flow policies are represented by {O1 : R1 ; ...; On : Rn }, where Ri is the set of principals that can read the data as set by the owner Oi . The owners of information can independently specify the information flow policies for their information, and the effective reader set is the intersection of each Ri . This means that a reader r can only access a particular piece of information if all owners authorize it. Trusted agents (represented by a double-border) can act on behalf of other principals within the system and can declassify policies. 80 While the Virtual Faraday Cage makes use of some similar concepts in information flow control, the level of sophistication in the Virtual Faraday Cage is far less than a complete and broadly applicable model such as the one presented by Myers and Liskov. The Virtual Faraday Cage assumes that the only untrusted components in a web application platform are third-party extensions, and information flow is controlled in the same general manner for each extension. Extension components running locally that examine private data can never send any information back to the third-party. As the Virtual Faraday Cage is relatively simple in its use of information flow control, the Decentralized Information Flow Control model is not used to describe the Virtual Faraday Cage any further in this chapter. Figure 4.3: The Virtual Faraday Cage modeled using Decentralized Information Flow Control. 81 In the Virtual Faraday Cage, local extension components could have unrestricted read access to all sensitive data, but would be unable to communicate that knowledge back to the third-party. Local extension components are allowed to write only to their cache space and to the end-user’s private data, as enforced by being run inside of a sandbox, both of which are unreadable by the remote extension component. Write operations by principals other than the owner of the data requires explicit owner approval. Proposition 4.1. The Virtual Faraday Cage permits only inbound information flow to Local Extension Components Proof: As Proposition 3.1 states, third-parties can never obtain information from local extension components because there is no capability for them to write to anything other than private data or local extension cache spaces. On the other hand, remote extension components may have write or alter capabilities on a particular data-item s ∈ Su , which in turn could be read by a local extension component. Additionally, each extension has a local extension cache space which is writable (but not readable) by the remote extension component. Consequently, information flow to the Local Extension Component is exclusively inbound. Another diagram outlining the information flow in the Virtual Faraday Cage is provided in Figure 4.4. 82 Figure 4.4: Information flow within the Virtual Faraday Cage. Dotted lines indicate the possibility of flow. 4.4 URIs Uniform Resource Identifiers (URIs) provide a richer context for data indices, and allow for hierarchies to be plainly visible when referencing data. While a data-item referenced at ID 92131 might be the same as the data referenced by a URI at data://johndoe/location, the latter method clearly shows that the data-item belongs to an end-user with an ID of johndoe. The URI referencing scheme within the site-specific implementation allows for one data-item to be referenced at multiple URIs – for instance, allowing data://johndoe/fri-ends/janedoe to be the same as data://janedoe. Similarly, if a data-item is owned by multiple owners, this allows for that data-item to be referenced at two different URIs that both demonstrate an ownership relationship from within the URI. 4.4.1 Domains Domains in a URI represent principals within the Virtual Faraday Cage, for example end-users or extensions. Domains are the hexadecimal representation of the hashed and opaque IDs of each principal. Consequently, instead of accessing data://johndoe/ you would be accessing something like data://33036efd85d83e9b59496088a0745dca7a6cd69 83 774c7df62af503063fa20c89a/ instead. 4.4.2 Paths URI paths should reflect the data hierarchy for a given web application platform. Ideally, the less variation there is in paths across different web application platforms, the easier it becomes to develop cross-platform extensions. For example, while paths such as /name may suffice for many platforms, moving an end-user’s ‘name’ to /profile/name may allow for less risk of conflict between different web application platforms – both specific implementations, as well as different categories. While this thesis does not present a strict hierarchy or ontology as a proposal for all implementations of the Virtual Faraday Cage, the proof-of-concept implementation used the following paths to represent end-user data: Data Paths on a Social Network • /profile/first-name - First name of the end-user • /profile/last-name - Last name of the end-user • /profile/gender - Gender • /profile/age - Returns the end-user’s age • /profile/location - Returns the end-user’s location • /posts - Returns a list of post IDs • /posts/<id> - Returns the content of the post • /friends - Returns a list of friend IDs • /friends/<id> - Returns the friend’s name (as seen by the end-user) 84 The proof-of-concept implementation was also able to perform URI translation – allowing it to switch between a fake social network for testing, and Facebook. In order to interface with the latter, URIs are “translated” from the Facebook URI scheme into a scheme matching the other. If this technique is applied to other sites, it would be possible for the same URI to be used for Facebook, MySpace, Google+, or any other social networking platform without requiring major changes for an extension running on top of the VFC API. For example, an extension could request to read the URI profile://736046350/name, and this URI would then be translated into https://graph. facebook.com/736046350/?fields=name and the appropriate data type returned. By keeping the structure of URIs the same across all web application platforms, this permits easier development and deployment of extensions across multiple platforms. Similarly, this allowed for the same extension to run on both an example social network as well as on Facebook. In the proof-of-concept, a Facebook wrapper was constructed which allowed access to the Facebook Graph API [116], and the following attributes were made available from Facebook data (as per User object properties [115]): 85 Facebook URI and Attribute Translation • id → null - Facebook user ID1 • first name → /profile/first-name - First name of the end-user • last name → /profile/last-name - Last name of the end-user • middle name → /profile/middle-name - End-user’s middle name • gender → /profile/gender - Gender • locale → /profile/locale - Returns the end-user’s locale (e.g., “en US”) • birthday → /profile/birthday - Returns the end-user’s birthday (note that /profile/age can be generated from this) • location → /profile/location - Returns the end-user’s location • /friends → /friends - Returns a list of the end-user’s friends • /friends/<id> → /friends/<id> - Returns the friend’s name • /statuses → /posts - Returns a list of status IDs • /statuses/<id> → /posts/<id> - Returns the status content 1 In the Virtual Faraday Cage, “true” IDs are not revealed, and consequently this was omitted from URI translation. 86 4.5 Application Programming Interfaces The Application Programming Interfaces (APIs) of the Virtual Faraday Cage represents the interfaces with which third-party extensions can interact with the underlying web application platform and end-user data and vice versa. The API embodies an access controlled asynchronous remote procedure call interface between third-parties and the underlying web application platform – and end-user data. Specifically, the API exists to support Propositions 3.1 and 4.1. 4.5.1 Web Application Platform API The Platform API consists of six methods. In practice, it is possible to build a higherorder API on top of of this ‘base API’; however, it should suffice to use only these six methods. Read-Request A read-request is a function r : URI×Z15 ×({0, 1}256 ∪{NULL}) −→ Data∪{NULL} that can return end-user data if this request is allowed by an explicit end-user policy. Readrequests are passed the URI of the data to be read, a “priority code”, and an optional callback ID. The priority-code represents an integer that maps to a specific ordering of privacy-policy types, allowing a read-request to specify priorities for data views. For instance, a read-request could prioritize request-on-demand over always, meaning that the data-view for always would be returned if the data view for request-on-demand fails. In total, there are 15 possible priority-codes corresponding to all possible orderings of length ≤ 3 of the set {request-on-demand, always, single-access}. Figure 4.5 illustrates the detailed steps of how a read request is handled within the Virtual Faraday Cage. 87 Figure 4.5: Process and steps for answering a read request from a principal. Start and end positions have bold borders, ‘no’ paths in decisions are dotted. If the request is authorized immediately, the data will be returned immediately; otherwise the only way to receive the data is through a callback. Alternatively, if one privacypolicy type can return data immediately, that view is returned to the requester, and the other view may be returned later if it becomes authorized (e.g., request-on-demand). The reason for requiring a callback is so that third-party extensions are not blocked while waiting for an end-user to authorize a request. Furthermore, by having callbacks, it becomes possible for some level of ambiguity regarding whether or not a particular end-user is online. While callback IDs are not required, not using them would prohibit updates from end-users with potentially more detailed or more up-to-date information. 88 Write-Request A write-request is a function w : URI × Data × ({0, 1}256 ∪ {NULL}) −→ {0, 1, NULL} that returns True, False, or None when called and passed data to write at a given URI. An optional callback ID is also available, which will allow for delayed responses from endusers (e.g., if end-user set write-access to request-on-write). An illustration showing the process of a principal requesting write access is shown in Figure 4.6. Figure 4.6: Process and steps for answering a write request from a principal. Start and end positions have bold borders, ‘no’ paths in decisions are dotted. Create-Request A create-request is a function c : URI×Data×({0, 1}256 ∪{NULL}) −→ {0, NULL}∪URI that returns a new URI location, False, or None. It functions in a similar way to a Write-Request, except that it will return a new URI corresponding to the location of the newly-created data if the call is successful. An illustration showing the process of a principal requesting create access is shown in Figure 4.7. 89 Figure 4.7: Process and steps for answering a create request from a principal. Start and end positions have bold borders, ‘no’ paths in decisions are dotted. Delete-Request A delete-request is a function d : URI × ({0, 1}256 ∪ {NULL}) −→ {1, NULL} that returns True, or None when called. A delete-request is passed a URI representing data to be deleted, and if the request is authorized, a True is returned. An illustration showing the process of a principal requesting delete access is shown in Figure 4.8. 90 Figure 4.8: Process and steps for answering a delete request from a principal. Start and end positions have bold borders, ‘no’ paths in decisions are dotted. Subscribe-Request A subscribe-request is a function s : URI × ({0, 1}256 ∪ {NULL}) −→ {1, NULL} that returns True or None when called. Subscribing to a URI allows for the subscribing entity to be notified when there are changes made to the data at the given URI. Later, when data is changed, the platform will check the complete list of subscribers for that data URI and all its parents, and then notify the subscribed principals only if their view of the data has been altered. An illustration showing the process of a principal requesting subscription access is shown in Figure 4.9. Another illustration showing the process of notifying subscribers when data has been altered is shown in Figure 4.10 91 Figure 4.9: Process and steps for answering a subscribe request from a principal. Start and end positions have bold borders, ‘no’ paths in decisions are dotted. Unsubscribe An unsubscribe is a function u : URI × ({0, 1}256 ∪ {NULL}) −→ {1, NULL} that returns True or None when called. Unsubscribing to a URI removes future notifications for updates or changes to the data at the given URI. An illustration showing the process of a principal performing an unsubscribe is shown in Figure 4.11. 92 Figure 4.10: Process and steps for notifying subscribed principals when data has been altered. Start and end positions have bold borders, ‘no’ paths in decisions are dotted. 4.5.2 Third-Party Extension API As each third-party extension can be unique and different, there is very little to mandate what functions third-party extension APIs should have. However, the following function is considered to be a minimum requirement for all third-party extensions: Get-Interface getInterface is a function i : NULL ∪ Any × {0, 1}256 ∪ {NULL} × ... −→ Any, that returns an “user interface” when called, and has a parameter for an optional callback ID. Additional parameters can also be passed along, if necessary or available. Depending on the type of web application platform, the interface returned may be an XML document, a web page, or some other data. If no value is passed to the method, a “default” interface should be returned. 93 Figure 4.11: Process and steps for answering an unsubscribe from a principal. Start and end positions have bold borders, ‘no’ paths in decisions are dotted. 4.5.3 Shared Methods In addition to the above APIs, there exist two shared methods that are required by both the web application platform, as well as all valid third-party extensions: Send-Callback sendCallback is a function b : {0, 1}256 × ({NULL} ∪ Data) −→ NULL that one party can use on another party’s API to return the value of a callback if and when it is completed. This method is implemented in the underlying Remote Procedure Call library. Verify-Nonce verifyNonce is a function v : {0, 1}256 −→ {0, 1} that returns true or false when called, depending on whether or not the nonce supplied is correct or incorrect. This is used in the High-Level Protocol for mutual authentication.See Section 4.6 for details. 94 4.5.4 Relationship with the Theoretical Model Read-Requests and Subscribe-Requests, and the data that they return abide by the Privacy-Policies in Definition 3.9. Similarly, Write-Requests, Create-Requests, and DeleteRequests all abide by the Write-Policies in Definition 3.10. As a consequence, all operations utilizing this API must abide by the underlying privacy and write policies as defined in the theoretical model in Chapter 3. This means that if principals within a system are forced to use only this API to interact with end-user data, then Proposition 3.1 and Proposition 4.1 hold. 4.6 High-Level Protocol The Virtual Faraday Cage’s “High-Level Protocol” (VFC-HLP) specifies how third-party extensions are interacted with, and how third-party extensions interact with the web application platform. The VFC-HLP can work with any underlying remote procedure call mechanism (not just LXRP), as it abstracts the process and only requires a RPC library that can provide both security and callbacks. This section specifies the VFC-HLP specifications. 4.6.1 Accessing a Third-Party Extension Accessing an extension is performed by specifying a URL from which a third-party extension to be installed, whereupon any remote and/or local components are loaded into the web application platform and made accessible to the end-user. By default, extensions must be installed over HTTPS connections where certificate verification can be performed; this ensures that mutual authentication between extensions and the web application platform can be reliably performed. 95 When a user wishes to install a simple web application extension, the process would not be much different than current architectures. As the extension does not need to perform sensitive operations, there is no need for that extension to have any local component whatsoever. When a user wishes to install a more complicated application extension, the process may become more involved. In particular, that extension’s local component must be downloaded and run within a sandbox by the web application platform. Upon authorization, the end-user must also specify what sensitive data, if any, the remote extension component has access to, and similarly, what private data, if any, the local extension component can access. Figure 4.12 shows an overview of the procedure for authorizing and accessing a third-party extension for a particular end-user. Figure 4.12: Steps required for authorizing and accessing a third-party extension. 96 Specifically, when accessing and authorizing a third-party extension, a VFC-compatible web application platform acting on the behalf of an end-user would be able to query the URL for the extension while sending a X-VFC-Client header identifying itself as interested in obtaining the specifications for the third-party extension. The third-party extension would then send back a 200 OK response, and send a LZMA-compressed JSON-encoded keyed-array that represented the extension specifications. These specifications would include the URLs for both RPC access to the thirdparty extension’s remote component API, as well as the URL for downloading the local extension component. Additionally, the types of end-user data requested on-install of the extension are also specified. See Figure 4.13 for an example. Figure 4.13: The EMDB extension specifications 97 Third-Party Extension Specifications The extension specifications returned by a third-party should consist of the following entries: • name - This is the full name of the third-party extension. • canonical-url - This is the URL for accessing the third-party extension specifications. • display-version - This is the version of the third-party extension, as displayed to end-users. • privacy-policy - This specifies the URL to the privacy-policy for the third-party extension. • owner - This is a keyed sub-array that describes the third-party, consisting of three parameters: name, short-name (optional), and url. • local-component - This represents a keyed-sub-array consisting of a url parameter (the URL to the local-component), as well as as a type parameter, which specifies the local component’s format (e.g., ’plain’, ’lzma’, ’gzip’, etc.) • remote-component - This is a keyed sub-array consisting of a type parameter and a url parameter. The type specifies the type of remote-component (e.g., ‘lxrp’), and the URL specifies its location. • request - This is a sequence of keyed-arrays, where each keyed-array consists of a data URI path (path), as well as a data-specific ‘reason’ (reason) for requesting that specific data. • purpose - This is a plain-text explanation of what the requested data will be used for, and why it is being requested. 98 • notes - (Optional) This entry represents plain-text notes that the third-party extension can choose to pass along. 4.6.2 Mutual Authentication In order for the Virtual Faraday Cage to operate in a secure manner, mutual authentication and access control are required between the Virtual Faraday Cage and third party extensions. When a third-party extension receives a connection request, it must verify that the requesting party really represents an authorized web application platform. Similarly, when a web application platform receives a request to perform some actions, it must verify which third-party extension is actually performing the requests. To facilitate this, when a connection is first received by any party (the ‘receiver’), that party will then make a connection back to whomever the connecting party (the ‘requester’) claims to be to verify the connection. Because outgoing connections can be made over SSL, the receiver can be assured that they are connecting to the true originator of the incoming connection. Once connected to the true originator, a cryptographic “nonce” (number that is used only once) supplied by the requester will be verified by the true originator. This will allow the receiver to be assured that the incoming connection from the requester is authorized. See Figure 4.14 and 4.15 for diagrams showing the mutual authentication process. If LXRP is the remote-procedure-call protocol of choice, it may be possible to efficiently disregard invalid or unauthorized requests from malicious or misbehaving clients depending on how the Reference Monitor is implemented. As the authentication method for VFC-Compliant LXRP servers requires mutual authentication through URLs, invalid authentication requests can lead to connection attempts to web servers that consume bandwidth and computational time. It may be possible to mitigate this attack by incorporating secret keys for authorized third-party extensions, or moving to a user-id/key 99 Figure 4.14: Authenticating an incoming connection from a VFC platform Figure 4.15: Authenticating an incoming connection from a third-party extension model instead. For other RPC protocols, there may be other methods available as well. 4.6.3 Privacy by Proxy “Privacy by Proxy” refers to the ability for the web application platform to act as a proxy for third-party extensions, allowing for personalization of extensions without needing to reveal end-user data to third-parties [65]. Within the Virtual Faraday Cage, extensions that display prompts or require data input from end-users can display data that they otherwise would not have access to by utilizing a special XML element <value>. For example, <value from="data://91281/name"/> would be substituted with user 91281’s 100 name, such as “John Doe”. This would be handled by altering the interfaces returned from Get-Interface calls. 4.7 Remote Procedure Calls 4.7.1 Overview The Virtual Faraday Cage uses the Lightweight XML RPC Protocol (LXRP) to handle all RPC calls between third-parties and the web application platform, or vice versa. LXRP is a relatively simple protocol that allows for RPC clients to easily interface with a given RPC server, and automatically discover the available methods and API documentation on connect. Similarly, LXRP allows for developers to easily expose methods and functionality to clients while restricting access by using a “Reference Monitor” object. Reference monitors were described by Anderson [117] as a supervisory system that “mediates each reference made by each program in execution by checking the proposed access against a list of accesses authorized for that user.” Anderson states that in order for a Reference Monitor to work, the reference validation mechanism must be tamper proof, must always be involved, and must be small enough to be tested. LXRP operates over the HTTP protocol, and sends messages in XML format via the HTTP POST method. Clients will also, by default, send an additional header, X-LXRP-Client, allowing servers to identify it as a legitimate LXRP client. This allows for LXRP servers to use the same URI for a web browser versus a LXRP client. Within the context of the Virtual Faraday Cage, this allows for third-party extensions to use the same URL for an informational page as the URL for adding the extension to a web application platform. For example, a request to http://www.imdb.com/app/ could be handled differently depending on whether or not the client was a web browser or a LXRP client. A web browser client could be given information on how to add the 101 extension to a web application platform such as Facebook, and an LXRP client would interface directly with the LXRP server. 4.7.2 Protocol Requirements In developing the RPC protocol for use in the Virtual Faraday Cage, a few key attributes were identified: • Encryption – All calls to the API had to be capable of being performed over SSL between servers with valid certificates. • Authorization – All API calls had to be authorized through a cryptographic token. • Authentication – There had to be a way to authenticate third-parties to ensure that they were who they claimed to be, and vice versa through mutual authentication. • Resilience – The protocol had to be resilient against malicious users and, ideally, denial-of-service attacks. • Extensibility – The protocol had to be extensible enough to add new API functions as they became necessary. 4.7.3 Requirement Fulfillment The Lightweight XML-based RPC Protocol (LXRP) was designed to incorporate these attributes. LXRP provides two main objects to developers: the API Interface and the Reference Monitor. To create an API instance, you must pass a list of allowed functions which can be called remotely, and a Reference Monitor instance which will act as the decision maker behind access to the API. 102 LXRP operates in a secure mode by default, forcing HTTPS connections and verifying SSL certificates according to a certificate chain passed to it on initialization. Upon initializing a connection to a LXRP API interface, a client passes a keyed array of connection credentials to the LXRP server, which then queries the Reference Monitor before issuing a cryptographic token to the client. These connection credentials can be anything from a username and password, to other data such as biometric signatures. Authorization for function calls is verified by the Reference Monitor for all RPC calls, this allows for partial exposure of functionality to “lower clearance” clients, as well as other access control possibilities. Authorization is based on a cryptographic token t chosen uniformly at random such that t ∈ {0, 1}256 . Consequently, the difficulty of forging a function call request can be made comparable or better than what can is expected from current security practices on the web. Extensibility within LXRP is inherent as there are no restrictions are on exposing new methods through it. LXRP exposes a set of functions to clients which can access these through a Resource object. These functions can be public or “private” methods, allowing for the easy extension of “support-specific” methods such as cryptographic nonce verification and callback support. Thus, the primary functionality of an API exposed through LXRP can be given through public methods, and any additional LXRP or VFCspecific functionality can be added through private methods. LXRP servers also support asynchronous remote procedure calls between LXRP servers, resulting in callbacks in reference to a method-call sent back to those servers. If the client to the first server requests a callback and passes along information about the second server and any authentication tokens needed to perform the callback, then the first server will dispatch a client to perform the callback if and when that function evaluation becomes available. Because a callback may not be guaranteed, LXRP clients should specify a maximum time-to-live (TTL) for a given method request, and consider 103 all callbacks that take longer than a certain duration to be lost, or specifically in the Virtual Faraday Cage’s situation, denied. 4.7.4 Protocol When a LXRP client attempts to connect to a server, it first sends an auth request along with any supplied credentials, and then the server will respond with an authentication token for the client to use for all future requests. Alternatively, pre-established tokens can be used, if available. Once an authentication token has been obtained, a LXRP client can then include it with all subsequent requests. Afterward, a query request can be sent to obtain the list of available functions that can be called by clients. Consequently, if a client already has their authentication token and knows which functions it can call, performing both the auth request and the query request are both optional. See Figure 4.16 for details. 4.7.5 Messages There are three message types: auths, queries and calls. “Auths” acquire authentication tokens for LXRP clients, “queries” simply get the list of available functions that a given authentication token can call, and “calls” are remote procedure calls that return the result of function evaluation. Auth A query message takes the form “<auth>[...]</auth>”, where the contents are the serialized Python data values of credentials supplied, if any. Query A query message takes the form “<query from="[...]"/>”, where the attribute from is optional and represents the authentication token. 104 Figure 4.16: The Lightweight XML RPC Protocol Call A call message takes the form “<call func="[...]" from="[...]" callback-uri=" [...]" callback-type="[...]" callback-id="[...]" callback-token="[...]"/> [...]</call>”, where the contents of the call message can include zero or more parameters of the form “<param name="[...]"/>[...]</param>”. The contents of parameters are the serialized Python data values. Like the query message, calls do not require the from attribute. Similarly, the callback-* attributes are not required either – they are only supplied if the requesting party wants to receive the response as a callback. If a function is called with parameters, each parameter is included in the body of the call. 105 4.7.6 Serialized Data Data is serialized recursively, allowing more complicated data-structures to be transmitted between the client and server within LXRP. All serialized data takes the form “<value type="[...]" encoding="[...]">[...]</value>”, where type represents the value type (e.g., “string”, “integer”, etc.) and encoding represents the encoding – currently only “plain” (default) and “base64”. The contents of the value tag would then be the string value of the data. If the data is a more complicated data-structure, then the value tag might contain other value tags, adding additional structural information. An optional sub-element, key can also be present within a value element – allowing for the value to be assigned a “name”, for example, in the context of a Python dictionary. Custom classes and objects, along with exceptions, can be serialized as well – however custom objects will need to implement their own deconstruction and reconstruction mechanisms to be passed successfully through LXRP. 4.7.7 Responses There are three response types in LXRP: auth-responses, query-responses, and callresponses. Auth-responses consist of a single auth element whose content are comprised of the serialized authentication token that the client should use. Query-responses contain a list of available methods along with their descriptions (both in human and computer formats), and also pass along any server-side flags and attributes for LXRP clients to interpret. Call-responses simply return the data-values from a call request, though they can also return errors and exceptions to any request. 106 4.7.8 Security LXRP relies on operating over the HTTPS protocol, allowing LXRP servers to leverage SSL and existing public-key infrastructure to provide encryption, confidentiality, and authentication of server end-points. Servers then implement their own Reference Monitor which then can issue authentication tokens to clients based on the authentication credentials they present, as well as a client’s IP address. The Reference Monitor provides access control over methods exposed over LXRP to clients, as all call-requests are first sent to the Reference Monitor for confirmation before they are executed. Authentication tokens are 256-bit strings, ideally supplied by a true-random source, or a strong pseudo-random number generator. In the current implementation, authentication tokens are derived from SHA-256 hashes of /dev/urandom for Unix-based systems, or the equivalent for Windows-based systems. 4.8 Sandboxing Sandboxing can be accomplished in a number of ways for different languages, and the Virtual Faraday Cage does not specify which particular mechanisms should be used. However, for the Virtual Faraday Cage to function properly, a robust sandboxing mechanism must be available and capable of running local extension components within it. This sandboxing mechanism must be able to guarantee that any code running within it is incapable of interacting or communicating with any system or software components outside of the sandbox. Additionally, the sandbox must be capable of allowing a limited selection of functions, specifically the Virtual Faraday Cage API, to be exposed to the code running within it. By ensuring that sandboxed code can only interact with the Virtual Faraday Cage API, we can continue to ensure that Theorem 3.1 and Theorem 4.1 remain valid within the system. Otherwise, third-party code would have unrestricted 107 access to the platform’s systems, and ultimately, to end-user data. Sandboxing was implemented in the Virtual Faraday Cage’s proof-of-concept implementation. While this implementation was developed in Python, the available authoritative references on sandboxing in Python was limited. As of Python 2.7, it seems that the command “exec code in scope”, where code is untrusted code, and scope is a keyed array (“dictionary”) is sufficient for protecting against code using unwanted functionality from within Python, so long as built-in functions have been removed from the scope. Other possibilities for implementing sandboxing included using the heavier pysandbox [118] library, or code from the Seattle Project[119]. Pysandbox is a Python sandboxing library that allows for extensive customization and control over sandboxed code. The Seattle Project, on the other hand, is a distributed computing platform that utilizes sandboxing to enable untrusted code to run on machines donating their computational resources. The Seattle Project’s sandboxing mechanism relies on static code analysis followed by a restricted scope for the exec command. As the proof-of-concept is only intended to demonstrate the feasibility of the Virtual Faraday Cage, the simpler and “built-in” approach of just using “exec code in scope” was undertaken instead. Using simply “exec code in scope” however exposes one critical flaw: passing any object to code ‘sandboxed’ in this manner, exposes the object’s parent class, and the code could conceivably find attributes or functions that could be called to leak information (or worse). However, if individual object methods are passed, the sandboxed code would be unable to retrieve or access the parent object. Additionally, any methods called from the sandboxed code cannot access anything from the parent scope – so the scope restrictions on sandboxed code apply at all times during execution. While “exec code in scope” works in Python versions 2.x, the sandbox safeguards have been removed in Python 3.x. This will mean that any attempts to performing sandboxing for Python 3 will require the use of tools such as pysandbox instead of built- 108 in functionality. Furthermore, “exec code in scope” does not prevent untrusted code from overusing CPU or memory resources, nor does it force execution within a set time interval. To provide a comprehensive sandbox mechanism accounting for those problems, “exec code in scope” would have to be run in a separate and monitored Python thread. For the purposes of demonstrating the Virtual Faraday Cage, it was sufficient to implement basic sandbox protection using only “exec code in scope” without using separate threads – but, this would not be a tenable solution for a production service. 4.9 Inter-extension Communication At this point, the Virtual Faraday Cage already presents a meaningful framework within which privacy-preserving extensions can be built. However, it lacks a certain capability that extensions running on popular platforms (such as social networking platforms) have, namely the ability to share and process data between different users of the same extension. To accomplish this, more than one approach must be considered (and these approaches may not necessarily be mutually exclusive) as carelessly adding this functionality may lead to information leakage and the loss of privacy by end-users. One approach to this problem is to prohibit the local extension components from having any write capability. This means that, given u1 , u2 as end-users, if a local extension component reads some private data s ∈ Xu2 , this local extension component cannot then write the value of s to the set of private data Xu1 for any other user u1 . This policy can be enforced by ensuring that wE 0 (s) = −1 ∀s ∈ Xu1 , Xu2 . However while this method may be useful in conjunction with other methods, by itself it would serve to greatly restrict the functionalities of local extension components, while simultaneously not addressing other potential privacy issues such as private data revelation to other end-users. While the thrust and focus of the Virtual Faraday Cage is to address privacy concerns with 109 third-parties specifically, this proposed approach leaves much to be desired in practice. Figure 4.17: Hypothetical prompt and example extension asking for permission to share end-user data. Another approach is for the extension to ask the end-user for permission and authorization to share or update shared data in the local cache. This could be done by presenting a prompt to the end-user and showing the full representation of the data to be shared. Figure 4.17 shows how this might be implemented in practice. As long as explicit and strong declassification from private data to sensitive data is prohibited, this could be considered a low-level of declassification. An extension’s local component could copy the shared data to another end-user’s private data, which would still be restricted 110 from third-party access. Additionally, as the end-user in question already authorized the viewing of that private data by the other end-user, there is no privacy violation in this context. 4.10 Methodology and Proof-of-Concept This section covers the methodology behind taking the theoretical model and turning it into the Virtual Faraday Cage’s architecture, and implementing the proof of concept. 4.10.1 Methodology Taking the Virtual Faraday Cage from its formal model to an implementable one consisted of several steps: 1) determining how data would be structured and accessed, 2) defining Application Programming Interfaces, 3) choosing an existing, or creating a new remote procedure call protocol to access these APIs, 4) determining how sandboxing untrusted third-party code could be performed, 5) developing a high-level protocol for VFC-compliant web application platforms and third-party extensions, and 6) implementing a basic proof-of-concept demonstrating the feasibility of realizing the Virtual Faraday Cage. Determining how data would be structured and accessed is fairly straightforward. Using URIs to reference data-items is both natural and intuitive, as well as easy to implement. In the formal model, data was abstracted as a vector comprised of atomic types within a global set of all end-user data but there was no innate way to access data-items or reference them. Binding data-items to URIs, on the other hand, provides a means to easily reference data-items, as well as a means through which a data hierarchy can be expressed. 111 Developing the API for the Virtual Faraday Cage required taking the Formal Model and building a set of operational primitives on top of it. The formal model provides both both read and write access, and the Virtual Faraday Cage extends these along with the hierarchical data structure (represented by data-item URIs) to make available an additional six specific data-manipulation methods: read, write, create, delete, subscribe, and unsubscribe. In determining how remote procedure calls would work in the Virtual Faraday Cage, three competing technologies were each examined as the potential protocol for communication between third-party clients and the Virtual Faraday Cage API server. These technologies were SOAP [120], REST [121], and ProtoRPC [122]. The possibility of a VFC-specific protocol was also considered. As Google AppEngine [123] was being used at the time to host the Facebook App and VFC-Wrapper, it was important to pick a technology that had libraries that could be used within AppEngine, because Google AppEngine does not support third-party C/C++ libraries [124]. Consequently, this limited the available choices for implementing the Virtual Faraday Cage. For SOAP, the libraries available included Ladon [125], Rpclib [126], and pysimplesoap [127]. Other libraries were not considered due to either their additional requirements, lack of maintenance, or other factors. While only pysimplesoap included a SOAP client, a library such as SUDS [128] was available as a Python SOAP client. For REST, the server libraries available were appengine-rest-server [129], web.py [130], Flask [131], and Bottle [132] so a REST client could then be implemented fairly easily [133]. Aside from appengine-rest-server, the remaining REST libraries were similar to each other and consequently only appengine-rest-server and web.py were examined in detail. ProtoRPC, on the other hand, was Google’s own web services framework, which passes messages in a JSON-encoded format. 112 While SOAP was the most attractive option due to its widespread support2 and it’s lack of dependence on the HTTP protocol, there were no available SOAP libraries that worked on Google App Engine without needing modifications. REST, on the other hand, limited the available methods to GET, PUT, POST, and DELETE. While overall, the Virtual Faraday Cage’s data model could easily be seen within the context of a REST-like system, this would limit the capabilities of the API. Thus, utilizing REST was not considered an optimal solution. Finally, Google’s ProtoRPC was unusable in Python 2.7+ at the time. Consequently, a VFC-specific protocol had to be written, the Lightweight XML RPC Protocol (LXRP). 4.10.2 Development The Virtual Faraday Cage was implemented via a Python-based proof-of-concept consisting of a single third-party entity providing a movie-comparison and rating extension, as well as a social network site that worked as a wrapper around Facebook data. While the implementation is sufficient to demonstrate the primary capabilities and flexibility of the Virtual Faraday Cage, it is not all-encompassing nor production-ready. The prototype described is also not intended to either be an efficient implementation nor a comprehensive framework for web application platforms. The proof-of-concept was developed entirely in Python 2.7. First, the “canonical” theoretical model was implemented (abstract data-items, privacy policies, projections, transforms, and views). Next, this model was extended into a “site-specific” model that incorporated URIs. Then, this model was extended to provide support for a local database and simulated social network. After this, third-party extension support was added to the implementation. 2 SOAP is a formal W3C specification [120], and has libraries in C/C++ [134], Java [135], Python [125, 126, 127] 113 Finally, a Facebook wrapper was implemented so that data made available to thirdparties came from a “real” social-network. This was done by creating a Facebook Application[4] and using Facebook’s Graph API[116] to access data, and rewriting the URIs to conform with specifications set by the Virtual Faraday Cage. 4.10.3 Proof-of-Concept This proof-of-concept implements the core aspects of the Virtual Faraday Cage, specifically: information flow control, sandboxing, and granular control over data revelation to third-parties. Information flow control is enforced through access control built into the Virtual Faraday Cage’s API, described in the following section. Sandboxing is enforced through CPython’s Restricted Execution Mode, which is discussed earlier. Granular control over data revelation is managed through the construction of Views, in a method analogous to what was described in the previous chapter. To implement the Virtual Faraday Cage, an original client/server architecture for APIs was created: the Lightweight XML RPC Protocol (LXRP). The rationale and reasons for this are described in Section 4.7.2, and an overview of LXRP is provided in Section 4.7.4. 4.10.4 Formal Model Here, we will cover examples of working with the Virtual Faraday Cage’s implementation, specific to the abstract formal model. Datastore and Data-items In this demonstration, we will create an AbstractDatastore “ds” and populate it with some data and principals. In this example, b is created as a member of Xp . See Figure 4.18. 114 Figure 4.18: Creating a datastore and some data-items in an interactive Python environment Introducing Projections Here we demonstrate the use of projections on date and location data-items. A projection p = P0,2 is created and tested on both value types. Another projection p2 = P0,1 is created and tested on the data as well. Note that passing multiple data-items to a projection results in outputting a list of the projections applied to each data item. Compositions of both projections are demonstrated, in the first case by “unpacking” the output via the * : [v1 , v2 ] −→ v1 , v2 modifier. See Figure 4.19. Figure 4.19: Applying projections on data-items in an interactive Python environment 115 Introducing Transforms In Figure 4.20, a transform is loaded that partitions integers (assumed to be ages) into different age brackets. As transforms are specially written functions, their output and mode of operation is completely determined by the developers who implement it. Figure 4.20: Applying a transform on data-items in an interactive Python environment Projections and Transforms In Figure 4.21, two different transforms are loaded and composed with a projection. One transform partitions ages into brackets, and another parses text (assumed to be names) and produces initials. Figure 4.21: Composing projections and transforms together in an interactive Python environment 116 Invalid Projection-Transform Composition Similar to the previous example, this example also composes transforms with projections. However, this time, we demonstrate how composition can result in errors as these compositions are not necessarily symmetric. In particular, applying AgeCalculation to AgePartition makes no sense: AgePartition simply labels numbers (assumed to be ages) as either “Adult” or “Child” – and AgeCalculation expects a date vector. Similarly, applying AgeCalculation to the projection p results in an error as well. See Figure 4.22. Views In this example, a view is created as a composition of AgePartition with AgeCalculation. This view can then be applied to data items as a composition of both transforms, as shown in Figure 4.23. Privacy Policies and Access Control In this example, data is first created before creating views that obscure data by the Initials and AgePartition transforms. Then privacy policies are created and assigned to those data items, retrieved, and applied. Similarly, write policies are created and retrieved. See Figure 4.24. 4.10.5 Example Third-Party The “Electronic Movie DataBase” (EMDB) was created as a hypothetical third-party that would have movie information and global ratings, and be able to store user-specific saved ratings and favorite movies. The EMDB would then supply a VFC-based extension which would do two things: 1) store movie ratings and preferences for end-users of the extension, and 2) obtain some demographic information (“Age”, “Gender”, “Location”) from the end-user through the social network that they are using. To this end, the 117 Figure 4.22: Composing projections and transforms in an invalid way, in an interactive Python environment EMDB was given a minimal web presence consisting of a landing page and a VFC-capable extension URI that acted as a REST-like web application. Upon an end-user installing the EMDB extension, they would be presented with the EMDB extension’s privacy policy and terms of service, along with the requested data to be shared with EMDB. The end-user can then apply any number of applicable and supplied projections and transforms to adjust the view of their data that EMDB would be granted. 118 Figure 4.23: Creating a view in an interactive Python environment The EMDB extension would then consist of two components: a remote component that would allow for movie searches and saving of movie ratings, and a local component that would permit the comparison and sharing of ratings with other end-users (e.g., “Friends”). On EMDB’s end, whenever a movie is rated or a rating is updated, the changes are pushed back to the social network’s local extension cache – ensuring that the local extension component always has the latest data to work with. 4.10.6 Facebook Wrapper To properly test the Virtual Faraday Cage in a “real” environment, a “Facebook App” was created. To adequately simulate the Virtual Faraday Cage, a “VFC-Wrapper” was created which acted as a buffer between a Facebook user’s data and a third-party extension. The VFC-Wrapper allowed for Facebook users to have more control over both which data would be revealed to a third-party as well as the granularity of such data. Third-party extensions would then be installed from within the VFC-Wrapper, and would be interacted with through it. 119 Figure 4.24: Creating and accessing privacy and write policies in an interactive Python environment 4.11 Effects and Examples While the primary goal and motivation for developing and using the Virtual Faraday Cage has always been to facilitate better control over end-user privacy while still gaining benefits from third-party extensions within web application platforms, other benefits also may be consequences of wider adoption of the Virtual Faraday Cage. This section explores those benefits and showcases some examples of how the Virtual Faraday Cage might be used in ways differing from conventional extension APIs. 120 4.11.1 A more connected web Previously, King and Kawash [136] proposed a protocol for sharing data between different web applications on different servers with the ultimate goal of better facilitating the “meshing” of online communities. It was argued that sharing data while letting each community administer their own site completely, could be a useful form of island-bridging for all parties involved. With the already-existing practices of developing extensions and APIs for web application platforms, a more connected web may also be a consequence of wider adoption of the Virtual Faraday Cage. While APIs and extensions already exist for web application platforms, there are no widely accepted or adopted privacy-aware layers for these systems. Additionally, while privacy concerns arise in social networking platforms, they are overlooked in the larger categories of other web applications. Finally, the Virtual Faraday Cage allows for a third-party extension that can perform tasks on private end-user data “blindly” without being able to relay that information back to the third-party: no known existing proposal addresses this. Consequently, the Virtual Faraday Cage may be an ideal architecture for sensitive web application platforms such as finance and healthcare. Figure 4.25 shows how different categories of web application platforms may become interconnected through extensions. Each arrow represents an embedding of an extension from one platform into another, with the presumption that potentially sensitive information will largely flow in the direction of the arrow, and private information never leaves any one platform for another. 121 Figure 4.25: Graph showing the potential connectivity between categories of web application platforms based on making extensions available from one platform to another. 122 4.11.2 Examples In this section, there are examples provided showing how The Virtual Faraday Cage can be useful in addressing the particular situation. In all these examples, the economics of privacy must be taken into account: an extension provider will likely want access to some data for their own benefit as well! Example 4.1. Movie comparison extension for a social network Alice and Bob are both end-users of a particular social networking site. Alice would like to compare her movie ratings with Bob’s and obtain some meaningful interpretations from that data, but does not wish to let a third party know about her friendship or the resulting compatibility ratings. In this example, the extension’s remote component can provide access to movie titles as well as global and personal movie ratings and so on, in exchange for some limited demographic information such as Age and Gender. The extension’s local component, on the other hand, would be able to access data provided by that end-user’s friends who also use that extension. However, the local component would be unable to relay any information learned back to the third party. Consequently, both end-users are able to compare their movie ‘compatibility’, without having to worry that information such as their friendship would otherwise leak back to the third party. Specifically, let the set of all of Alice’s visible data to that extension be VE,u = {Age, Gender}, and the set of all of Alice’s private data visible to that extension’s local component be V0 E 0 ,u = {Friends, Age}. Certain data could also be made less specific, for example, an age range instead of a specific age. Depending on how inter-extension communication is handled, each user of the extension would have to authorize the sharing of their movie rating information with other users’ of the extension – such as through 123 the prompt shown in Figure 4.17. Example 4.2. Date finder Bob is an end-user on a social networking site, and he would like to find someone to date from among his friends, friends-of-friends, or other social circles. He does not mind searching for potential romantic interests on an online dating site, but he would prefer that his matches know people that he knows. Furthermore, Bob would prefer not to announce to everyone else on his social network that he is using an online dating service. Most social networking sites are not geared explicitly towards the facilitation of online dating or online personals, and it remains a possibility that specialized services can do a better job of matchmaking. In this example, Bob would like to preserve the privacy of information such as who his friends are and perhaps limit the extent of what personal data is visible to the dating service. To facilitate this, the dating service could request information such as his age (or age range), gender, and personal interests. It would then combine that data with additional information supplied directly to the dating service such as his sexual orientation, relationship status, and so on. A list of Bob-specific hashed friend-IDs could be stored in the local cache for Bob, allowing a local extension component to iteratively check his friends to see if their hashed friend-ID matches and that they are users of the dating service as well. At the same time, other matches from the online dating service could be presented to Bob based on the information he chose to share with the third-party. Here, the set of Bob’s visible data to the remote extension component is VE,u = {Age, Gender, Location, Interests}, and his set of visible private data to the local extension component is V0 E 0 ,u = {Friends, Friends’-Profiles}. 124 Example 4.3. Extending a map provider with a road trip planner Alice is looking to plan a road trip using a map provider of her choice and a road trip planner web service. She would like to plan out her route and stops as well as know the weather along her route, but she would like to keep specific location information and travel dates as hidden as possible from the planning service. In this example, Alice is planning her road trip route through a map provider that picks the best route for her. A road trip planner web service combines a calendar/eventplanning service along with weather information, and it provides an extension to map providers and their users. For the extension to work, it requests information about which cities are along the routes Alice plans to take; the more cities shared, the more capable the extension is at assisting planning with regards to weather. The rest of the trip planning is done through the local extension component, serving to prevent the third-party from knowing anything more than the waypoints along her route. This could be accomplished by ensuring that Alice’s set of visible data to the road trip planner is VE,u = {Cities-Along-Route} and her set of private data accessible by the planner’s local component is V0 E 0 ,u = {Travel-times}. For this to work, the local extension component E 0 would then need to receive the maximum weather information (typically around two-weeks) for all the cities along the route, and then the trip planning could be done within E 0 , with details saved to the local cache. It is also possible to distrust the map provider for storing the event planning information, however this would require that this information be kept at the road trip planner’s web service instead. This would mean that the only information that would be truly unavailable to them would be the specific addresses along the end-user’s route. Ultimately, either scenario is valid and the “better” scenario depends on which web service an end-user wants to trust for which data. 125 Example 4.4. Integrated dictionary, thesaurus, and writing helper Bob utilizes a web office suite to compose his documents, which range from essays and papers to personal letters. He would like to use an integrated dictionary/thesaurus, especially one that could catch things such as overused words or phrases, but he would not like to share his documents with any third-parties. While using the web to look up words is a relatively small inconvenience for Bob, it would be nice to have an integrated system available within his web office suite. Furthermore, a histogram of word frequencies, as well as alerts to common grammar mistakes or overused phrases could be useful as well – but this cannot be directly accomplished without revealing the contents of the document to a third party. By using the Virtual Faraday Cage however, it becomes possible for a dictionary/encyclopedic provider to create an extension that uses a local component to analyze a document for overused or repetitious words or phrases, while keeping word-lookups on the remote site. Consequently, only small excerpts of the document (individual words) are shared with the third-party ephemerally, and only by the explicit authorization of the end-user. In this case, Bob’s set of visible data to the remote extension component is VE,u = {Excerpts}, and his set of private data visible to the local extension component is V0 E 0 ,u = {Document}. Example 4.5. Automated bidding extension for an online auction Alice would like to install an extension to a web application that facilitates online auctions. This extension allows Alice to automate her bidding process to improve her chances of obtaining an item during an auction. However, she would like to prohibit the extension provider from learning about her shopping habits or financial information. 126 Here, an automatic bid helper could run within the protected environment as a local extension component, and utilize external third-party supplied statistics or databases to help with its decision making. It could combine the third-party supplied information with private data such as how much the end-user is willing to pay to make intelligent bids. Thus, Alice’s visible data to the third-party’s remote extension component is VE,u = NULL, and her visible data to the local extension component is V0 E 0 ,u = {Item, Finances, Other-Parties-Involved-In-Bid, ...}. Example 4.6. Augmented online shopping with trusted reviewers Bob shops for many items online, however sometimes it can be challenging for him to know which reviews of a product he can trust. Because disgruntled customers and online marketing firms can skew product reviews both ways, he would like to know which reviews to trust, or to see reviews specifically written by friends or friends-of-friends from his social network. In this example, a social network third-party could provide an extension that automatically acquires a list of hashed IDs of friends or friends-of-friends (similar to Example 3.2, thus proactively protecting the privacy of social network users) of Bob, downloading them into his local extension cache. The local extension component would then check the current items that Bob is viewing, and then emphasize reviews written by friends or friends-of-friends according to the data in the cache. This way, the social network also has no information on the types of products that Bob is viewing. Here, Bob’s visible data to the remote extension component is VE,u = NULL, and his visible data to the local extension component is V0 E 0 ,u = {Currently-Viewing}. 127 Example 4.7. Extending a healthcare site with fitness and dietary extensions Alice uses a web service to manage her medical records and keep track of her visits, and diagnoses and checkups. She also uses a diet and fitness web service to keep track of her workouts and her nutritional and dietary needs. She would like to compose these different web services in a way that preserves her privacy – as she does not need or want the fitness service knowing what her ailments are or her medical records. Because of the extreme sensitivity of medical data, as well as accordingly strict and detailed legislation pertaining to it – the Virtual Faraday Cage may be an ideal architecture through which medical web services can extend their capabilities through third-party extensions. In this example, the fitness/dietary third-party could provide an extension that could analyze doctor recommendations and provide appropriate fitness regimens or dietary suggestions without relaying any information about the diagnosis back to the thirdparty. This would be accomplished by storing a database of fitness information in the local extension cache. In exchange, the end-user might be asked to share some basic information back to the third-party (e.g., Gender, Weight, Height, Age) at some level of granularity (e.g., ranges). For this example, Alice’s visible data to the remote extension component is VE,u = {Gender, Weight, Height, Age}, and her visible private data to the local extension component is V0 E 0 ,u = {Medical-History, Ailments, Recommendations}. 128 4.12 Summary This chapter has introduced and covered the Virtual Faraday Cage. The Virtual Faraday Cage enables web application platforms to integrate third-party extensions to their platforms, while simultaneously enabling the complete protection of subsets of end-user data (“private data”). Furthermore, any data disseminated to third-parties can be done so at a granular, user-controlled level. The Virtual Faraday Cage also comes with privacy guarantees backed by a theoretical framework. The Virtual Faraday Cage’s API was introduced in Section 4.5, and its high-level protocol was introduced in Section 4.6. A proof-of concept for the Virtual Faraday Cage and the methodology in developing it are covered in Section 4.10. Finally, a discussion of the effects of the Virtual Faraday Cage, as well as potential examples of its application to other types of web application platforms is given in Section 4.11. The next chapter will provide an in-depth analysis of this thesis’ contributions and conclude this work. 129 Chapter 5 Analysis & Conclusion This chapter concludes the discussion of the Virtual Faraday Cage, covering comparisons to existing work, as well as shortcomings and criticisms of the approaches in this thesis. This section also discusses future work. 5.1 Comparisons and Contrast 5.1.1 PIPEDA Compliance Section 1.4.2 introduces the Personal Information Protection and Electronic Documents Act (PIPEDA) [21], a Canadian privacy law that dictates how organizations can collect, use, and disclose personal information. PIPEDA also establishes ten principles that an organization must uphold: 1) Accountability, 2) Identifying Purposes, 3) Consent, 4) Limiting Collection, 5) Limiting Use, Disclosure, and Retention, 6) Accuracy, 7) Safeguards, 8) Openness, 9) Individual Access, and 10) Challenging Compliance. The Virtual Faraday Cage supports these principles: Principle 1 - Accountability (PIPEDA Section 4.1) “Organizations shall implement policies and practices to give effect to the principles, including (a) implementing procedures to protect personal information [...]” (PIPEDA Section 4.1.4) 130 The Virtual Faraday Cage supports this by providing a fine-grained and strict access control mechanism that prevents unauthorized data dissemination. Principle 2 - Identifying Purpose (PIPEDA Section 4.2) “The organization shall document the purposes for which personal information is collected in order to comply with the Openness principle (Clause 4.8) and the Individual Access principle (Clause 4.9)” (PIPEDA Section 4.2.1) The Virtual Faraday Cage’s model does not enforce purpose, but purposes are collected from third-parties when they request access to end-user data within a given web application platform. “The identified purposes should be specified at or before the time of collection to the individual from whom the personal information is collected. Depending upon the way in which the information is collected, this can be done orally or in writing. An application form, for example, may give notice of the purposes.” (PIPEDA Section 4.2.3) Purposes are collected and specified when a third-party requests accesses to an enduser’s data. Principle 3 - Consent (PIPEDA Section 4.3) “Consent is required for the collection of personal information and the subsequent use or disclosure of this information. Typically, an organization will seek consent for the use or disclosure of the information at the time of collection. In certain circumstances, consent with respect to use or disclosure may be sought after the information has been collected but before use (for example, when an organization wants to use information for a purpose not previously identified).” (PIPEDA Section 4.3.1) The consent of end-users is required before any data is shared with third-parties, and end-users dictate whether or not this data is shared once, always, or if they must be asked per-use. 131 “The principle requires ‘knowledge and consent’. Organizations shall make a reasonable effort to ensure that the individual is advised of the purposes for which the information will be used. To make the consent meaningful, the purposes must be stated in such a manner that the individual can reasonably understand how the information will be used or disclosed.” (PIPEDA Section 4.3.2) When consent is requested from end-users, the purposes are provided at that moment of time. “An organization shall not, as a condition of the supply of a product or service, require an individual to consent to the collection, use, or disclosure of information beyond that required to fulfill the explicitly specified, and legitimate purposes.” (PIPEDA Section 4.3.3) The Virtual Faraday Cage specifically facilitates the ability for end-users to choose a granularity “view” of their data such that their data is not revealed at all to third-parties. Consequently, the Virtual Faraday Cage architecture implicitly supports this. “Individuals can give consent in many ways. For example: (a) an application form may be used to seek consent, collect information, and inform the individual of the use that will be made of the information. By completing and signing the form, the individual is giving consent to the collection and the specified uses; (b) a checkoff box may be used to allow individuals to request that their names and addresses not be given to other organizations. Individuals who do not check the box are assumed to consent to the transfer of this information to third-parties; (c) consent may be given orally when information is collected over the telephone; or (d) consent may be given at the time that individuals use a product or service” (PIPEDA Section 4.3.7) Consent is done on a per-data-item basis, and can manifest in in multiple displayed forms, all of which can comply with (a) and/or (b), and (d). 132 “An individual may withdraw consent at any time, subject to legal or contractual restrictions and reasonable notice. The organization shall inform the individual of the implications of such withdrawal.” (PIPEDA Section 4.3.8) The Virtual Faraday Cage can notify third-parties that an end-user has withdrawn their consent for the use of an extension. Principle 4 - Limiting Collection (PIPEDA Section 4.4) “Organizations shall not collect personal information indiscriminately. Both the amount and the type of information collected shall be limited to that which is necessary to fulfill the purposes identified. Organizations shall specify the type of information collected as part of their information-handling policies and practices, in accordance with the Openness principle.” (PIPEDA Section 4.4.1) The Virtual Faraday Cage supports this by forcing third-parties to specify specifically what data is being collected and how it is revealed. The Virtual Faraday Cage also supports end-users in determining if revealing that data to a third-party extension is appropriate for the specified purposes. Principle 5 - Limiting Use, Disclosure, and Retention (PIPEDA Section 4.5) The Virtual Faraday Cage does not specifically address or aid in upholding this principle. Principle 6 - Accuracy (PIPEDA Section 4.6) “Personal information that is used on an ongoing basis, including information that is disclosed to third parties, should generally be accurate and up-to-date, unless limits to the requirement for accuracy are clearly set out.” (PIPEDA Section 4.6.3) The Virtual Faraday Cage allows third-parties to maintain current information on end-users, as permitted by end-users. In particular, end-users decide if, when, and how often a third-party may request the “latest” information. Consequently, the accuracy principle is facilitated through the Virtual Faraday Cage, as permitted by end-users. 133 Principle 7 - Safeguards (PIPEDA Section 4.7) “The security safeguards shall protect personal information against loss or theft, as well as unauthorized access, disclosure, copying, use, or modification. Organizations shall protect personal information regardless of the format in which it is held.” (PIPEDA Section 4.7.1) The Virtual Faraday Cage employs an access control system that protects against unauthorized actions on an end-user’s sensitive data – no such data can be read or written unless explicitly authorized by the end-user. Principle 8 - Openness (PIPEDA Section 4.8) The Virtual Faraday Cage does not specifically address or aid in upholding this principle. Principle 9 - Individual Access (PIPEDA Section 4.9) “Upon request, an organization shall inform an individual whether or not the organization holds personal information about the individual. Organizations are encouraged to indicate the source of this information. The organization shall allow the individual access to this information. However, the organization may choose to make sensitive medical information available through a medical practitioner. In addition, the organization shall provide an account of the use that has been made or is being made of this information and an account of the third parties to which it has been disclosed.” (PIPEDA Section 4.9.1) Within the Virtual Faraday Cage, an end-user can “see” all of their sensitive data, as part of the requirements for ensuring that end-users can set appropriate access control policies on them. Consequently, end-users can, at all times, know what sensitive data the web application platform has, as well as what sensitive data has been revealed to third-parties. 134 “In providing an account of third parties to which it has disclosed personal information about an individual, an organization should attempt to be as specific as possible. When it is not possible to provide a list of the organizations to which it has actually disclosed information about an individual, the organization shall provide a list of organizations to which it may have disclosed information about the individual.” (PIPEDA Section 4.9.3) As stated under PIPEDA Section 4.9.1, an end-user can see which third-parties have had access to sensitive data. “An organization shall respond to an individual’s request within a reasonable time and at minimal or no cost to the individual. The requested information shall be provided or made available in a form that is generally understandable. For example, if the organization uses abbreviations or codes to record information, an explanation shall be provided.” (PIPEDA Section 4.9.4) With the Virtual Faraday Cage, it is technologically feasible to make the list of sensitive data immediately accessible to any end-user through a privacy settings portal. “When an individual successfully demonstrates the inaccuracy or incompleteness of personal information, the organization shall amend the information as required. Depending upon the nature of the information challenged, amendment involves the correction, deletion, or addition of information. Where appropriate, the amended information shall be transmitted to third parties having access to the information in question.” (PIPEDA Section 4.9.5) While not forced, third-parties with access to a specific data-item can subscribe to changes in that data – should an end-user alter that data’s value, the third-parties will be notified in real-time. 135 Principle 10 - Challenging Compliance The Virtual Faraday Cage does not specifically address or aid in upholding this principle. While the Virtual Faraday Cage is not a complete solution for PIPEDA compliance, it does assist in allowing for a web application platform to better maintain such compliance despite allowing third-parties access to end-user data. Specific to the Canadian Privacy Commissioner’s findings on Facebook [22], the Virtual Faraday Cage can be used as a system to help alleviate concerns over third-party access to end-user data. In particular, in the findings, the Commissioner wrote that the original CIPPIC complaints regarding Facebook and third-party “applications” (extensions) were that Facebook: 1. “was not informing users of the purpose for disclosing personal information to third-party application developers, in contravention of Principles 4.2.2 and 4.2.5;” 2. “was providing third-party application developers with access to personal information beyond what was necessary for the purposes of the application, in contravention of Principle 4.4.1;” 3. “was requiring users to consent to the disclosure of personal information beyond what was necessary to run an application, in contravention of Principle 4.3.3;” 4. “was not notifying users of the implications of withdrawing consent to sharing personal information with third-party application developers, in contravention of Principle 4.3.8;” 5. “was allowing third-party application developers to retain a users personal information after the user deleted the application, in contravention of Principle 4.5.3;” 136 6. “was allowing third-party developers access to the personal information of users when their friends or fellow network members added applications without adequate notice, in contravention of Principle 4.3.2;” 7. “was not adequately safeguarding personal information in that it was not monitoring the quality or legitimacy of third-party applications or taking adequate steps against inherent vulnerabilities in many programs on the Facebook Platform, in contravention of Principle 4.7;” 8. “was not effectively notifying users of the extent of personal information that is disclosed to third-party application developers and was providing users with misleading and unclear information about sharing with thirdparty application developers, in contravention of Principles 4.3.and 4.8;” 9. “was not taking responsibility for the personal information transferred to third-party developers for processing, in contravention of Principle 4.1.3; and” 10. “was not permitting users to opt out of sharing their name, networks, and friend lists when their friends add applications, in contravention of Principle 4.3 and subsection 5(3).” The Virtual Faraday Cage helps address 1, 3, 6, 8, and 10 of the CIPPIC’s allegations – and can help alleviate allegations 2, 4, and 7. In the Privacy Commissioner’s report, Facebook was found to be in violation of Principles 2 and 3 (“I am concerned that users are not informed of what personal information developers are accessing and are not adequately informed of the purposes for which their personal information is to be used or disclosed.”), as well as Principle 7 (“[...] given the vast potential for unauthorized access, use, and disclosure in such circumstances, I am not satisfied that contractual arrangements in themselves with the developers constitute adequate safeguards for the 137 users personal information in the Facebook context.”). One of the privacy challenges that Facebook is facing is that they have essentially pushed the responsibilities of ensuring privacy protection to that of the third-party, however at the same time, this third-party cannot held to the same level of trust and adherence to policies as Facebook itself. While Facebook has attempted to mitigate this by providing a “Facebook Verified” badge for third-party extensions that pay a fee and can explain the data that they collect – this does not address the issues of either the revelation of “basic information,” nor the reuse of data for purposes other than stated, and it is an optional audit. Furthermore, when addressing the issue of third-parties having access to data about end-users who did not install their extensions (e.g., friends of end-users who did), Facebook’s solution was to have anyone concerned about this opt-out of using any third-party extensions. The Privacy Commissioner stated that Facebook should develop a means by which to monitor third-party extensions so as to ensure that third-parties are complying with consent requirements – and that Facebook should consider providing third-party developers a template they can use to explain their data needs and purposes. Another issue was that third-parties did not need to obtain consent from other users on a network (or on an end-user’s friend list) when a third-party extension is installed. The Privacy Commissioner also found that Facebook’s use of contractual agreements to be insufficient with regards to Principle 7 – especially as the principle explicitly states the need for technological measures. As stated in Section 1.4.2, despite the changes Facebook made to address the original PIPEDA complaint filed against it, CIPPIC filed another complaint [24] in 2010 which expressed their dissatisfaction with Facebook’s response and indicated that they felt that many of the core concerns they had were not addressed by these changes, including the lack of support for fine-grained and granular control over end-user data when shared with 138 third-parties. In this context, the Virtual Faraday Cage may be a prime framework to assist with addressing some of those complaints. The Virtual Faraday Cage helps web application platforms like Facebook provide end-users with third-party extensions, while simultaneously protecting end-users from unintended or unauthorized data leakage and helping reduce the necessary trust placed in these third-parties. By using the Virtual Faraday Cage, platforms like Facebook can reduce the need to rely on third-parties policing themselves, as well as reducing the need for an “all-in” or ”none-at-all” approach to using third-party extensions and sharing data with them. For instance, friend list contents would be revealed as opaque IDs unless consent was obtained from the individual end-users on that list. The Virtual Faraday Cage thus explicitly helps address the Privacy Commissioner’s recommendations regarding “Third-Party Applications.” 5.1.2 Comparisons with Other Works Section 1.5.2 presents Bonneau et al.’s work [66]. They examined how private or sensitive data could be obtained from an online social network without end-users’ knowledge, and demonstrated how this could be accomplished through interacting with Facebook’s API using the Facebook Query Language (FQL). As FQL queries return the number of records in a query, this can be exploited to leak some information about end-users within Facebook. Consequently, as a precaution, the Virtual Faraday Cage API specifically returns no information in cases where the principal does not have access to the data being queried. For instance, a read-request or write-request that requires end-user permission per request is designed to limit the opportunity for third-parties to know if and when the user is online. Similarly, existence tests are prohibited because the result of performing an API method on non-existent data is the same as performing an API method on data that one does not have permission to access. 139 Section 2.2.1 introduced Rezgui et al.’s [96] and Dekker et al.’s [97] works. Rezgui et al. identified and defined many aspects of privacy within Web Services, and we now compare it with the Virtual Faraday Cage. Dekker et al. propose using formal languages to write licenses for data dissemination, however the enforcement or privacy would need to be accomplished through legal means. Rezgui et al. define ‘user privacy’ as a user’s ‘privacy profile’. This profile consists of the user’s privacy preferences for their data, specific to a particular ‘information receiver’ and purpose. They define ‘service privacy’ as a comprehensive privacy policy that specifies a Web Service’s usage, storage, and disclosure policies. Finally, ‘data privacy’ is defined as the ability for data to “expose different views to different Web Services”, similar to a notion of an adjustable granularity or that of projections and transforms in the Virtual Faraday Cage. While Dekker et al.’s [97] approach may yield numerous advantages over an enduser agreeing to an in-English privacy policy (or a P3P policy), such a system is still addressing a fundamentally different aspect of privacy than the Virtual Faraday Cage. In their approach, the authors attempt to examine a way for legally-binding license agreements for end-user data to be processed and analyzed by machines for the purpose of enforcement checking and forming derivative licenses. While a well-behaved entity could abide by a license, a malicious one would have to be taken to court – a solution that the Virtual Faraday Cage does not consider ideal as no legal remedy would be capable of restoring that end-user’s privacy once it was lost. The Virtual Faraday Cage attempts to avoid this scenario all-together by preventing [private] data dissemination in the first place. 140 Section 2.3.1 presents Guha et al.’s [107] “None Of Your Business” (NOYB). However, one of the limitations of NOYB is that it does not account for how communication between end-users (where one is not using NOYB) can be achieved. Furthermore, it does not account for how end-user data can be shared legitimately with third-parties (or even the application provider) for legitimate and useful features and capabilities. This is in fact not a focus of NOYB. This is in contrast with the Virtual Faraday Cage, which seeks to facilitate meaningful [sensitive] data-sharing between end-users and third-parties, as well as meaningful use of private data by third-party extensions that are unable to leak that data back to the third-parties. Additionally, while NOYB and similar projects may be able to bypass detection by web application providers, the use of NOYB may still constitute a violation of the platform’s terms of service – as NOYB essentially piggybacks on Facebook to provide its own social network experience. Ultimately, NOYB provides something different than what the Virtual Faraday Cage provides. Baden et al.’s [109] Persona proposal is for a new architecture for privacy-preserving online social networks. Persona uses a distributed data-storage model, and requires end-users have browser extensions that can encrypt and decrypt page contents. Their idea is to provide for finer grained access control with the stipulation that access to the decrypted data is not possible without permission granted from the end-user in the first place. While their model provides significant security guarantees, it requires special browser extensions for all end-users, and does not incorporate a system by which untrusted third-party extensions can work with end-user data without being capable of revealing that information back to the third-parties. 141 Section 2.3.2 describes Felt and Evan’s [65] proposal for a privacy-by-proxy API for Facebook and other social networks. However, their proposal prevents all data disclosure through the web application platform, on the basis that most Facebook apps (extensions) do not need to access such data in the first place. Instead, they argue that data disclosure should happen separately, outside of the platform, and directly to the third-party. However, this would do nothing to address granular data disclosure, or the ability to hide data from the third-party while still allowing the third-party to work with that data. This thesis argues that catering to the ‘lowest denominator’ [of third-party extensions] is inappropriate. Just because many current extensions within Facebook are “junk” does not mean that web application platforms should prohibit more meaningful extensions from functioning by hard-coded prevention of end-user data usage. Ultimately, combining aspects of Felt and Evans’ work with the Virtual Faraday Cage might present the best benefits in practice. As these two approaches are not mutually exclusive, doing so is straightforward: All extensions have the capability to use personalization without requiring end-user data access, but extensions that need it will have to go through the Virtual Faraday Cage. A proposal for how this can be done is presented in Section 4.6.3. Felt et al.’s [112] study of Google Chrome extensions proposed a permission ‘danger’ hierarchy, where more ‘dangerous’ permissions are emphasized over less ‘dangerous’ ones (See Section 2.4). However this observation is not obviously applicable to the Virtual Faraday Cage as different types of data stored on different types of web application platforms may have similar URIs but totally different contents and privacy risks associated with revealing them. On the other hand, because the Virtual Faraday Cage supports both install-time and run-time permission requests – and as it is designed to be built into a parent platform, it is possible to imagine how a review or vetting process could be added to further reduce the presence of malicious extensions that end-users might 142 install despite warnings generated by permission and capability requests. However, at the same time, because the dangerous components of extensions are effectively limited to the remote components, a review and vetting process may be of limited use within the Virtual Faraday Cage. Fredrikson and Livshits [113] argue that the browser should be the data-miner of personal information, and that the browser could contain third-party miner ‘extensions’ within it. In other words, all data operations must go through the browser, which is considered trusted. Similarly, the Virtual Faraday Cage requires that all operations on private data (that must not be relayed back to the third-party) be performed within the web application platform and not performed remotely. 5.2 Time & Space Complexity As the Virtual Faraday Cage introduces several new architectural components, including adding new metadata as well as new workflows for accomplishing tasks within a web application platform – it is important to consider the potential impacts of the Virtual Faraday Cage on performance and data storage. This section covers the time and space complexity analysis for the Virtual Faraday Cage, including the complexity costs associated with components such as hashed and opaque IDs, granular views, sandboxing, and message protocols. 143 5.2.1 Hashed IDs Hashed IDs, as implemented by means of a cryptographic hash function, would require O(|D|+|P|) of storage, where D represents all the sensitive data (for all principals) within a system, and P represents the set of principals. Using SHA-256, this would require an additional 256 bits of storage per object. As hashed IDs would be stored along with the records of actual objects, the additional computational complexity for looking up the true object should be constant-time. While the use of a hash function may incur potential collisions, with a suitably chosen hash function this should not occur in practice. An alternative to using a hash function would be to encrypt the real IDs of objects using a fast symmetric cipher (e.g., Rijndael/AES) which would also eliminate the concern for collisions. 5.2.2 Opaque IDs If opaque IDs are implemented by means of cryptographic hash function, then a lookup table is needed to perform a reverse-lookup from an opaque ID to a real ID. This would require storage of up to O((|D| + |P|)2 ) or O(D2 + D · P + P2 ), where P represents the set of all principals. As the number of principals and/or sensitive data grows large, the storage space that must be allocated grows to unmaintainable levels. On the other hand, if opaque IDs are implemented by means of symmetric-key encryption, where each user has a different opaque ID “key”, then storage space can be reduced to simply O(|P|), or on the order of the number of entities in a given system. Using a system like AES, this would require adding an extra 256 bits to each entity record in a database and is far more realistic. Because opaque IDs would reveal the true ID of a given object when decrypted, and because they have a fixed size for a given web application platform, the additional computational costs of using opaque IDs would be constant. 144 5.2.3 Views Estimating the time and space complexity of views is more challenging than other components of the Virtual Faraday Cage. Without fixing the maximum dimension of projections, or prohibiting the repetition of a dimension in a given projection, it becomes impossible to fix an upper bound for storage costs. However, if we assume a fixed constant d to represent the maximum dimension of any projection, and we assume that the maximum dimension of any data item that may be put into a view will be a fixed constant d0 (where d0 ≥ d), then the upper-bound for storage of a given projection will be d · log2 (d0 ) bits or O(1). If the total number of transforms that a given web application platform supports is bounded by a fixed-constant t, then the storage space that a single transform will occupy will be log2 (t) or, again, O(1). Finally, if the total number of projections and transforms that can be chained together is fixed as c, then the total storage space of a given view is bounded by max((d · log2 (d0 ) · c), (log2 (t) · c)) bits, which again reduces to O(1). For example, if d is 32, d0 is 232 , t is 28 , and the maximum chain c is generously 32, then the maximum storage size for any given view will be 4 kilobytes. By further constraining the maximum dimension of data (e.g., d0 = 16) the maximum storage size for a given view will be at most 1 kilobyte. The computational time needed to apply a given view on a particular data-item will vary depending on the computational time of given transforms, and on the particular data-item size s. For projections, it should be possible to implement a solution requiring O(s) operations. If the computational time needed to display or send a particular dataitem from the platform to an end-user or third-party is considered to be an O(s) operation as well, then barring computationally-intensive transforms (beyond O(s)), the additional computational resources should be negligible. Consequently, the total computational 145 time for a view should be bounded by O(cs), which reduces to O(s) because c is a constant. Similarly, for storage of a view, it can be presumed that because a view will never add additional information, the storage space for the view’s result should never be greater than s. Consequently, the total storage complexity of a web application platform supporting views is O(s). As views can be ephemeral, this space can be further reduced in practice as once a view is generated and distributed, the space it occupied can be freed for future operations. 5.2.4 Access Control Storing the read/write access control rules for end-users will require at most O((3|D| + |D|) · |P|), which reduces to O(|D| · |P|) storage space. The maximum number of entries for read views for a single data item is limited to a constant number (at most the number of access types – in our case, 3), and the lookup table for views would be comprised of a column for principal IDs, a column for data item IDs, a column for the requesting principal’s ID, a column for the access type (which has three types), and a column for the corresponding view. Because each access control rules can apply to different thirdparties, the size will ultimately scale with the number of third-parties multiplied by the amount of data. On average, the actual storage size may be much lower (e.g., the average end-user may never exceed granting access to more than 100 third-parties), but the worst case is still very large. Using the example numbers for a view’s size (1 kilobyte), assuming that IDs occupy 64 bits of space, and allocating 4 bits for the view type, the read-rules would be just slightly over the view size (e.g., 1049 bytes). For write rules, the maximum number of entries is exactly |D| · |P|, consequently the storage complexity is O(|D| · |P|), and the individual entry size would be much lower (e.g., 25 bytes). 146 As for the computational time needed to query the access control rules and apply them, this would depend on the particular implementation and performance of database queries. Assuming that a composite primary key index comprised of the data owner principal ID, data item IDs, a requesting principal ID, and the access type could be created, and assuming lookups based on primary keys could be completed within O(|D|log(|D|)) time, then the total time complexity would be the same. This is because the additional steps required (looking up hashed IDs and opaque IDs) are constant time, and resolving a data item’s URI would ideally just be another O(|D|log(|D|)) operation. 5.2.5 Subscriptions Storing subscription information for data items will require at most O(|D| · |P|) storage space. Individual records in such a table would comprise of a data ID and a subscriber ID, which may reduce to 64-bytes per record. As with the access control records, in practice the average data item may have few subscribers, reducing the average case storage complexity to something that scales linearly with the amount of data in the system. Like access control, the time complexity of accessing the subscriber list would depend on the implementation and performance of database queries, and dispatching messages to third-party subscribers would also incur a computational cost – ideally bounded by the average case. 5.2.6 Sandboxing Estimating the time and space complexity for sandboxing is a challenging task. First, no time or space constraints were specified when sandboxing was described, but without assuming constraints, there is no way to provide estimates. Consequently, let us assume that for a particular platform, limits on the storage space allocated for third-party extension local components is fixed, as is the computational resources that such a component 147 can utilize. This may be represented in both “global” and “per-user” limits – either way, we may be able to assume that the maximum storage space for a single extension’s local component is a fixed size. Borrowing from the W3C recommendations for HTML5 Web Storage [137], a limit of 5,242,880 bytes of storage per domain origin may be a reasonable starting point, but this would cause problems for multiple extensions hosted on the same primary domain. There may also be a variety of ways that storage and computational resources may be made available to third-parties, for instance, are long-lived processes allowed? Can additional storage space be “purchased” by a third-party? Does the amount of storage space scale with the number of end-users utilizing that third-party’s extension? Alternatively, is the third-party charged for every user that utilizes their extension? As a result of these questions, the storage space required for sandboxing may scale with the number of third-party extensions (that have local components), or it may scale with the number of users of an extension (with a local component), or both – across all such extensions. Similarly, computational requirements are also difficult to examine: even if we assume that local components must execute within a fixed time t or otherwise be terminated – resulting in O(|E|) (where E represents the set of all extensions) maximum computational resources – such estimates are both incomplete and impractical as they do not take into account the amount of resources likely to be used across any given point in time, and similarly fail to take into account the fact that end-users expect near instantaneous results when interacting with a web application. 148 5.2.7 Protocol There are two protocols presented within the Virtual Faraday Cage – the “High-Level Protocol” (VFC-HLP), and the protocol used to execute remote procedure calls, which is LXRP. The time complexity of executing VFC-HLP’s operations (authorizing an extension, and mutual authentication) are constant: a fixed number of messages is passed to authorize an extension or ascertain that a party is who they claim to be. Similarly, for LXRP, the time complexity for executing any of its operations is also fixed: initial authentication is a fixed two-message process, querying for the available methods is also a fixed two-message process, and method calls also result in a fixed two-message, or at most four-message (e.g., with a callback) process. As for the space complexity – the size of communications depend strictly on the amount of data being passed. Platform-wide, the total amount of messages passed will depend on the total number of method calls between third-parties and the platform, which is difficult to estimate. 5.2.8 Summary In general, apart from sandboxing and the protocol it uses, the additional overhead generated by the Virtual Faraday Cage scales linearly with the number of principals and/or data items within the system. With reasonable constraints (e.g., limits on dimensionality, view chain length, etc.), the additional load due to the Virtual Faraday Cage will likely be manageable: new operations are completed in constant-time, and new data storage scales linearly with the total amount of data stored. In the context of social networking platforms, where individual object content can be very large (e.g., photographs and videos), the amount of additional storage overhead needed to incorporate access control rules, hashed IDs and opaque IDs will be very minimal in comparison to the actual data item size. As the additional storage needed is bounded by a fixed constant, this size difference would be linear in practice. 149 5.3 Shortcomings 5.3.1 Personal Information Protection and Electronic Documents Act While the Virtual Faraday Cage does help support many of PIPEDA’s principles, it does fall short in a few key areas. For example, no support for the changing of purposes is provided inherently by the Virtual Faraday Cage, something that PIPEDA’s Principle 2 and 5 request support for. PIPEDA’s eighth principle, “Openness”, is partially supported through the Virtual Faraday Cage – users can see what data the web application platform has on them, however, no support is made to granting end-users a view of what third-parties have on them. The Virtual Faraday Cage also does not assist in Principle 10, “Challenging Compliance”. Consequently, the Virtual Faraday Cage is not a panacea for addressing PIPEDA compliance, but instead may be viewed as one component of a multi-faceted approach to achieving compliance. 5.3.2 Inter-extension Communication The Virtual Faraday Cage proposes that inter-extension communication should be handled by prompting the end-user to decide whether or not to permit it. This would be accomplished by requiring the end-user to decide whether or not to share the data with other users of the same extension. While the Virtual Faraday Cage can ensure that any data revealed through this mechanism remains exclusively in the private data sets within the web application platform (assuming that data declassification is not permitted), there is no obvious way to prohibit information leakage to other end-users unless the data owner can verify that the contents of the data being shared are not private. However, this approach is vulnerable to a different problem: end-users may decide to 150 accept whatever data they are being presented with sharing. This may result in a phenomenon similar to “click fatigue”, where a user may no longer spend the necessary time to read a security dialog before choosing their answer [138]. Additionally, as the amount of data grows, a ‘raw’ view into the data to be shared may become impractical. While a user can easily determine whether or not they would like to “share their movie ratings with their friends”, it is less clear whether or not they would like to share a long list of raw data (e.g., Figure 4.12). Consequently, in the long-term, an alternative approach to obtaining user consent is likely necessary. One such approach is discussed in Section 5.3.3. 5.3.3 Proof-of-Concept As the implementation of the Virtual Faraday Cage was intended to be a proof-of-concept and not a “production-ready” framework for deploying the Virtual Faraday Cage, there exist many areas where the implementation could be improved: • Efficiency - The current implementation does not take into account efficiency: many objects are sub-classed multiple times, special vector objects are used to represent data, and the storage, retrieval, and execution of privacy policies and their corresponding views can be made both more memory and computationally efficient. Additionally, communication efficiency can be improved greatly by switching from a raw XML-based protocol to something that uses compression and/or binary data, such as EXI1 . 1 Efficient XML Interchange[139] 151 • Sandboxing - Sandboxed local extensions need to confined to a set amount of memory, limited in their CPU consumption, and be terminated if their execution time exceeds certain margins. This should be configured by the web application platform, potentially on a per-extension basis. Additionally, the execution of sandboxed code needs to continue to be fast even with these additional changes. This may mean that a production-ready architecture would necessitate sandboxed extensions running on a separate dedicated server for the sake of performance. • Third-Parties - Currently, third-party extension identities are bound to their URIs. If a third-party were to change the URI of their extension, they would lose all access rights to resources on a web application platform. This can be rectified either through a per-application mechanism for changing third-party extension IDs, or having extension IDs be something other than directly derived from (and fixed to) the third-party extension URLs. • Callbacks - Callback delays are not implemented, however in practice, some level of delay may be preferable. This may need to be set at the end-user level, so that for certain applications, artificial delays are not forced (e.g., for a real-time mapping extension). Alternatively, end-users might be able to set a flag or permission that allows for extensions to receive immediate responses from them at risk of revealing when they are online. • Seamless Remote Procedure Calls - Currently, only methods can be exposed through the LXRP system. Future work could allow the direct exposure of objects to clients rather than manually specifying methods, however this would largely be a stylistic improvement rather than a functional one. 152 • Selective API Revelation - LXRP’s architecture does not allow for selective function revelation to clients depending on their credentials, however this should be implemented in the future. 5.3.4 Hash Functions Throughout this thesis, the Virtual Faraday Cage makes extensive use of the SHA-256 hash function. While the SHA-256, SHA-384, and SHA-512 family of hash functions has wide adoption and has been studied extensively for security vulnerabilities, there is no guarantee that vulnerabilities will not be discovered in the future. However, implementations of the Virtual Faraday Cage can utilize any strong cryptographic hash function as better methods are discovered. 5.3.5 High-Level Protocol The Virtual Faraday Cage’s High-Level Protocol (VFC-HLP) can verify that an incoming connection is coming from the URL it is claiming to come from, however, is this sufficient? Will user’s be able to differentiate between two similar-looking extension URLs? And, what if someone were able to put a malformed extension on the same host or top-level domain? While preventing phishing attacks is not one of the Virtual Faraday Cage’s design goals, it is important to acknowledge that such attacks may become an eventuality. One potential solution to mitigate such attacks is to mandate a fixed format for all extension URLs, for example “https://extension.yourdomain.com/”. However, this approach suffers from inflexibility, and only addresses a very specific set of impersonation attacks (namely, attacks from the same top-level domain). 153 Apart from informing the user clearly about the third-party when they attempt to grant data-access to a third-party extension, there is no obvious remedy to this problem. Improvements to the user-interface may allow for users to clearly check to see if a given URL for an extension is valid, and similarly, the display of SSL certificate information may allow for users to be more confident about those extensions. Despite this, even if end-users accidentally allow a malicious or impersonating extension access to their data, their private data remains private and cannot leave the web application platform. This remains one of the strongest benefits to the Virtual Faraday Cage’s unique architecture. 5.3.6 Time & Space Complexity Analysis While some aspects of the time and space complexity for the Virtual Faraday Cage have been discussed, and seem promising, a full and in-depth consideration of the additional overhead required by the Virtual Faraday Cage has not been discussed in detail. Additionally, the analysis of the time and space complexity for sandboxing is incomplete, as is the complexity of the Virtual Faraday Cage’s protocol when considered platform-wide. 5.4 Future Work 5.4.1 Purpose While Purpose is considered unenforceable by the Virtual Faraday Cage, allowing thirdparties to claim a purpose can still serve a useful role in the Virtual Faraday Cage. Currently, third-parties must supply data-specific purposes for the initial data that they request (if any) from end-users when end-users first grant these third-party extensions access to their data. 154 The Virtual Faraday Cage can easily be extended to allow for purpose specification when a third-party attempts to perform any requested operations, allowing an end-user to potentially better gauge whether or not to allow an operation or not. While thirdparties could easily lie about their purposes, having the ability to both log and track data sharing for specified purposes may both improve the end-user and third-party developer experience, as well as provide some form of accountability and tracking when data is shared with third-parties. As mentioned in Section 5.3.1, the Virtual Faraday Cage should support purpose to a greater extent, for example, by allowing a third-party to re-specify a new purpose for the use of data. 5.4.2 Enhanced Support for Legal Compliance As indicated in Section 5.3.1, the Virtual Faraday Cage falls short as a complete system for PIPEDA compliance. Additionally, apart from PIPEDA, there are several other important privacy laws that exist worldwide (see Section 1.4.2). Future work with the Virtual Faraday Cage should seek to further address the issue of privacy law compliance, and seek to enhance the Virtual Faraday Cage as a tool for privacy law compliance. 5.4.3 Callbacks For many API function calls, a callback ID can be passed as an optional parameter so that the Virtual Faraday Cage can send a potentially delayed response to the third-party extension. This provides two benefits: 1) third-party extensions do not have to wait for processing to occur on the back-end, or wait for end-user input, and 2) callbacks can be delayed or dropped, making it unclear to the third-party when the end-user was online and denied a request. In the latter case, this helps prevent leakage of information regarding when a user was online or not. 155 For example, in the context of a third-party extension asking for geo-locational data, if an end-user had authorized that third-party extension to always see the the city or country that the end-user resides in, the initial map view could be of that locale and if the user then accepts sharing their fine-grained details, the more detailed locational information could be sent to the extension through the callback, and the map view updated. If the end user decides not to accept the request, then the callback request may never be sent. Callback delays were not implemented in the proof-of-concept for the Virtual Faraday Cage, and there are no set guidelines for how the delays should ideally be implemented. Future work in this area could explore callback delays and their impact on privacy, as well as how “ideal” delays should be implemented. 5.4.4 Inter-extension Communication As indicated in Section 5.3.2, Inter-extension Communication remains a challenge in the Virtual Faraday Cage. Not only is communication of end-user data between the same extension difficult, but there is no proposal for how different extensions may communicate, if at all. One way to address this problem might be to split the local extension component’s cache space from the channel for inbound communication from third-parties. Instead of a unified local extension cache space, there would be a remote cache and a local cache. The remote cache would allow for the third-party to write directly to it, however no other entities would be able to write to data in that location. The local cache would grant full read and write capabilities to the local extension component, but no other entities would be able to read or write to the local cache. In this scenario, the web application platform would then decide which other end-users’ extensions can read from the remote cache. This may be done automatically, or by prompting the end-user to choose other end-users 156 specifically. On a social network, this may be dictated by friendship relations between end-users. As the remote cache read-only for local extension components, it prohibits the leakage of one end-user’s private data to another end-user’s caches; the only data that can be shared with other users of that extension is data that the third-party already had, and upstream communication is still prohibited. Future work should explore this area, as one of the big benefits to social-network based extensions is that they can leverage an end-user’s social network. Without being able to communicate effectively and in a privacy-preserving manner between extensions running on the same platform however, this no longer becomes an advantage. 5.4.5 Time & Space Complexity and Benchmarking As stated in Section 5.2.5, 5.2.6, and 5.3.6, time and space complexity analysis for sandboxing and the protocol are currently lacking. First, while limits on computational and storage resources for local extension components should be decided on a per-platform basis, guidelines should be proposed, along with the rationale behind them. Ideally, limits should be expandable on a per-extension and/or per-user basis – this could be accomplished through manual means (e.g., the platform reviews and decides), or through charging end-users and/or third-parties. Additionally, an in-depth study of how local extension components can be structured to both provide maximum features (e.g., supporting long-lived tasks) and minimize the resource costs. As indicated in Section 5.2.5, one potential avenue to addressing computational and storage resource usage would be to pass the costs to either third-parties or end-users, or both. A future study should ascertain how such a system would work, and if such a system would be economically viable and attractive to all involved parties. Another option may be to consider offloading storage and computational resources to end-users themselves, e.g., by requiring that third-party local extension components to 157 be either written in JavaScript or converted to such – and then running them within a sandbox inside the end-users’ web browser. Finally, benchmarks should also be established – both for sandboxing as well as for the messages passed within the protocol. Without concrete data regarding both of these aspects of the Virtual Faraday Cage, it would remain unclear as to how scalable the Virtual Faraday Cage is in practice. 5.4.6 URI Ontology Structuring data across web application platforms in a uniform way may also be highly advantageous, however it is not clear how one may accomplish this. With a unified URI ontology for data across these different web platforms, it may be possible to write extensions that can operate across these platforms because the underlying end-user data is still structured in the same way and located at the same URIs. Future work exploring this area may overlap with work on the Semantic Web [140], as the latter is an attempt to standardize the categorization and structure of diverse data content. 5.5 Summary This thesis has presented a new and novel architecture for web application platforms that seek to facilitate the interaction between end-users and third-party extensions in a privacy-aware manner. Not only can information be shared in granular way, but information can be withheld completely from third-parties while still made usable by third-party extensions. The Virtual Faraday Cage advocates a paradigm shift from the idea that privacy must be sacrificed for utility, to the idea that privacy can be preserved while still gaining some utility. In the process, this thesis has presented an overview of privacy in the area of web applications, and highlighted many of the challenges associated with existing approaches. 158 This thesis has also presented a theoretical model, within which security and privacy issues were examined. By using this model and by splitting end-user data into two disjoint sets (private data and sensitive data) while enforcing a strict information flow control policy, private data can be unconditionally protected from disclosure to third-parties. On the other hand, it was also shown that any sensitive data that is ever disclosed to any third-party cannot be protected from disclosure to unauthorized parties. Following that, the Virtual Faraday Cage was presented and described in detail. The Virtual Faraday Cage architecture abides by the theoretical model, and consequently is capable of providing the same unconditional guarantee towards the protection of end-user private-data. The Virtual Faraday Cage is applicable towards a broad category of web applications, and is capable of supporting both “current-style” third-party extensions as well as new hybrid extensions and locally-hosted (sandboxed) extensions. Finally, a proof-of-concept for the Virtual Faraday Cage was also constructed, demonstrating the Virtual Faraday Cage’s core functionalities as feasible. In conclusion, privacy continues to be a significant and major challenge for web applications, and the Virtual Faraday Cage is just one step towards addressing some of these problems. 159 Bibliography [1] K. Barker, M. Askari, M. Banerjee, K. Ghazinour, B. Mackas, M. Majedi, S. Pun, and A. Williams, “A data privacy taxonomy,” in Proceedings of the 26th British National Conference on Databases: Dataspace: The Final Frontier, ser. BNCOD 26. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 42–54. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-02843-4 7 [2] Department of Computer Science. (2012) The advanced database systems and application (ADSA) laboratory. University of Calgary. [Online]. Available: http://www.adsa.cpsc.ucalgary.ca/ [3] MySpace Developer Platform. (2010) Applications FAQs. Retrieved 4/20/2012; Last modified (according to site) on 9/9/2010. [Online]. Available: http://wiki.developer.myspace.com/index.php?title=Applications FAQs# I just published my app. How long will the approval process take.3F [4] Facebook. (2012) Facebook developers. [Online]. Available: http://developers. facebook.com/ [5] D. tion, M. Boyd history, nication, vol. and N. B. Ellison, “Social network sites: Defini- and scholarship.” Journal of Computer-Mediated Commu13, no. 1, pp. 210 – 230, 2007. [Online]. Avail- able: http://ezproxy.lib.ucalgary.ca:2048/login?url=http://search.ebscohost.com/ login.aspx?direct=true&db=ufh&AN=27940595&site=ehost-live [6] Unknown. ble Blog. (2007) [Online]. 10 facts Available: you should know about bebo. Socia- http://www.sociableblog.com/2007/11/14/ 10-facts-you-should-know-about-bebo/ 160 [7] J. Owyang. (2008) Social network stats: Facebook, myspace, reunion. Web Strategy by Jeremiah Owyang. [Online]. Available: http://www.web-strategist. com/blog/2008/01/09/social-network-stats-facebook-myspace-reunion-jan-2008/ [8] J. users, book. Smith. (2009, continuing [Online]. February) to grow Available: Facebook by 600k surpasses users/day. 175 million Inside Face- http://www.insidefacebook.com/2009/02/14/ facebook-surpasses-175-million-users-continuing-to-grow-by-600k-usersday/ [9] Friendster Inc. (2009) About friendster. Friendster, Inc. Retrieved on December 22nd, 2009. [Online]. Available: http://www.friendster.com/info/index.php [10] P. Perez. (2009) Orkut — stimulate your social life. Social Networks 10. Retrieved on December 22nd, 2009. [Online]. Available: http://www.socialnetworks10.com/ orkut [11] LiveJournal, Inc. (2009) About LiveJournal. LiveJournal, Inc. Retrieved on December 22nd, 2009. [Online]. Available: http://www.livejournal.com/ [12] Tencent Inc. (2009) What is qq? I’M QQ - QQ Official Site. Retrieved December 22nd 2009. [Online]. Available: http://www.imqq.com/ [13] W. Luo, Q. Xie, and U. Hengartner, “FaceCloak: An architecture for user privacy on social networking sites,” Computational Science and Engineering, IEEE International Conference on, vol. 3, pp. 26–33, 2009. [14] United Nations. (1948) The universal declaration of human rights. Retrieved on 4/20/2012. [Online]. Available: http://www.un.org/en/documents/udhr/ [15] United Nations Educational Scientific and Cultural Organization. (2012) 161 UNESCO privacy chair - motivation. Retrieved 4/20/2012. [Online]. Available: http://unescoprivacychair.urv.cat/motivacio.php [16] P. Guarda and N. Zannone, “Towards the development of privacy-aware systems,” Information & Software Technology, vol. 51, pp. 337–350, 2009. [Online]. Available: http://academic.research.microsoft.com/Publication/5881826/ towards-the-development-of-privacy-aware-systems [17] S. Kenny and J. J. Borking, “The value of privacy engineering,” Journal of Information, Law and Technology, vol. 2002, 2002. [Online]. Available: http://academic. research.microsoft.com/Publication/794202/the-value-of-privacy-engineering [18] N. Kiyavitskaya, A. Krausova, and N. Zannone, “Why eliciting and managing legal requirements is hard,” in Requirements Engineering and Law, 2008. RELAW ’08., sept. 2008, pp. 26 –30. [19] R. Gellman, “Privacy in the clouds: Risks to privacy and confidentiality from cloud computing,” World Privacy Forum, Tech. Rep., 2009. [20] P. Roberts. (2011) HIPAA bares its teeth: violation. threatpost: $4.3m fine for privacy The Kapersky Lab Security News Service. Re- trieved 4/20/2012. [Online]. Available: http://threatpost.com/en us/blogs/ hipaa-bares-its-teeth-43m-fine-privacy-violation-022311 [21] G. of Canada, “Personal information protection and electronic documents act,” 2000. [22] E. Denham (Assistant Privacy Commissioner of Canada), “Report of findings into the complaint filed by the canadian internet policy and public interest clinic (cippic) against facebook inc. under the personal information protection and electronic doc- 162 uments act,” Findings under the Personal Information Protection and Electronic Documents Act (PIPEDA), 2009. [23] P. Lawson, “Pipeda complaint: Facebook,” Canadian Internet Policy and Public Interest Clinic, 2008. [24] T. Israel, “Statement of concern re: Facebooks new privacy approach,” Canadian Internet Policy and Public Interest Clinic (CIPPIC), 2010. [25] European Parliament and Council, “Directive 95/46/ec of the european parliament and of the council on the protection of individuals with regard to the processing of personal data and on the free movement of such data,” Official Journal of the European Communities, 1995. [26] ——, “Directive 2002/58/ec of the european parliament and of the council concerning the processing of personal data and the protection of privacy in the electronic communications sector (directive on privacy and electronic communications),” Official Journal of the European Communities, 2002. [27] E. Commission, “Regulation of the european parliament and of the council on the protection of individuals with regard to the processing of personal data and on the free movement of such data (general data protection regulation),” European Commission, 2012. [28] United Kingdom. (1998) Data protection act 1998. [Online]. Available: http://www.legislation.gov.uk/ukpga/1998/29/contents [29] U. S. D. of Justice. (1988) Overview of the privacy act of 1974, 2012 edition. [Online]. Available: http://www.justice.gov/opcl/1974privacyact-overview.htm 163 [30] O. o. J. P. J. I. S. U.S. Department of Justice. (1986) Electronic communications privacy act of 1986. [Online]. Available: https://it.ojp.gov/default.aspx?area= privacy&page=1285 [31] R. Gross and A. Acquisti, Privacy Enhancing Technologies. / Heidelberg, 2006, ch. Imagined Communities: Springer Berlin Awareness, Information Sharing, and Privacy on the Facebook, pp. 36 – 58. [Online]. Available: http://www.springerlink.com/content/gx00n8nh88252822 [32] Facebook. (2012) Facebook ads. [Online]. Available: http://www.facebook.com/ advertising/ [33] J. Quittner. (2008) MySpace to businesses: Kiss myads. Retrieved on 4/20/2012. [Online]. Available: http://www.time.com/time/business/article/0,8599,1849458, 00.html [34] P. Domingos and M. Richardson, “Mining the network value of customers,” in KDD ’01: Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM Press, 2001, pp. 57–66. [Online]. Available: http://dx.doi.org/10.1145/502512.502525 [35] S. Staab, P. Domingos, P. Mike, J. Golbeck, L. Ding, T. Finin, A. Joshi, A. Nowak, and R. R. Vallacher, “Social networks applied,” Intelligent Systems, IEEE [see also IEEE Intelligent Systems and Their Applications], vol. 20, no. 1, pp. 80–93, 2005. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=1392679 [36] N. Mook. (2005) Cross-site scripting worm hits MySpace. betanews. Retrieved 4/20/2012. [Online]. Available: cross-site-scripting-worm-hits-myspace/ http://betanews.com/2005/10/13/ 164 [37] K. J. Higgins. (2012) Worm siphons 45,000 facebook accounts. Dark Reading (UBM TechWeb). Retrieved 4/21/2012. [Online]. Available: http: //www.darkreading.com/insider-threat/167801100/security/attacks-breaches/ 232301379/worm-siphons-45-000-facebook-accounts.html?itc=edit stub [38] S. Machlis. (2010) How to hijack facebook using firesheep. Computerworld (PCWorld). Retrieved on 4/21/2012. [Online]. Available: http://www.pcworld. com/article/209333/how to hijack facebook using firesheep.html [39] J. Gold. (2012) Social predators still gaming the system on facebook. NetworkWorld (PCWorld). Retrieved 4/21/2012. [Online]. Available: http://www.pcworld.com/ article/254185/social predators still gaming the system on facebook.html [40] D. Danchev. (2012) Facebook phishing attack targets syrian activists. ZDNet (CBS Interactive). Retrieved 4/21/2012. [Online]. Available: http://www.zdnet. com/blog/security/facebook-phishing-attack-targets-syrian-activists/11217 [41] S. Mukhopadhyay. (2009) Black woman murdered by stalker who had been using Youtube and Facebook to threaten her. Feministing. Retrieved 4/20/2012. [Online]. Available: http://feministing.com/2009/04/14/black woman murdered by facebo/ [42] A. Moya. (2009) An online tragedy. CBS News. Retrieved 4/20/2012. [Online]. Available: http://www.cbsnews.com/stories/2000/03/23/48hours/main175556. shtml [43] Metro.co.uk. (2009) Facebook stalker is given a life ban for bombarding student with 30 messages a day. Associated Newspapers Limited. Retrieved 4/20/2012. [Online]. Available: http://www.metro.co.uk/news/ 765760-facebook-stalker-is-given-a-life-ban-for-bombarding-student-with-30-messages-a-day 165 [44] D. Thompson. (2011) Facebook stalker won’t ’fit in’ to prison, says lawyer. Associated Press (via MSNBC). Retrieved 4/20/2012. [Online]. Available: http://www.msnbc.msn.com/id/42019388/ns/technology and science-security/ [45] T. Telegraph. Telegraph line]. Media Available: (2012) Facebook Group stalker Limited. ’murdered’ Retrieved ex-girlfriend. 4/20/2012. [On- http://www.telegraph.co.uk/news/uknews/crime/9041971/ Facebook-stalker-murdered-ex-girlfriend.html [46] T. N. Jagatic, N. A. Johnson, M. Jakobsson, and F. Menczer, “Social phishing,” Communications of the ACM, vol. 50, no. 10, pp. 94–100, 2007. [47] Microsoft Corporation. (2013) Phishing: Frequently asked questions. [Online]. Available: http://www.microsoft.com/security/online-privacy/phishing-faq.aspx [48] Discovery Channel. (2007) 2057. Discovery Communications LLC. Retrieved on 4/20/2012. [Online]. Available: http://dsc.discovery.com/convergence/2057/2057. html [49] CBC. (2009) Depressed woman loses benefits over Facebook photos. CBC News (Montreal). [Online]. Available: http://www.cbc.ca/canada/montreal/story/2009/ 11/19/quebec-facebook-sick-leave-benefits.html [50] J. M. Grohol. (2011) Posting about health concerns on facebook, twitter. Psych Central. [Online]. Available: http://psychcentral.com/blog/archives/2011/02/01/ posting-about-health-concerns-on-facebook-twitter/ [51] S. Li. (2011) Insurers are scouring social media for evidence of fraud. Los Angeles Times. [Online]. Available: http://articles.latimes.com/2011/jan/25/business/ la-fi-facebook-evidence-20110125 166 [52] Agence book France-Presse. speeding Available: (2010) clip: Police report. charge ABS-CBN driver over Interactive. Face[Online]. http://www.abs-cbnnews.com/lifestyle/classified-odd/10/13/10/ police-charge-driver-over-facebook-speeding-clip-report [53] G. leads Laasby. to (2011) arrest. Greenfield Journal Sentinel trieved 4/20/2012. [Online]. Available: teen’s facebook Online (Milwaukee post for Wisconsin). drugs Re- http://www.jsonline.com/news/crime/ greenfield-teens-facebook-post-for-drugs-leads-to-arrest-131826718.html [54] WRAL Raleigh-Durham Fayetteville, N.C. (2005) NCSU students face underage drinking charges due to online photos. Capitol Broadcasting Company Inc. (Mirrored by the Internet Archive Wayback Machine). Retrieved 4/20/2012. [Online]. Available: http://web.archive.org/web/20051031084848/http://www. wral.com/news/5204275/detail.html [55] D. Chalfant. party-goers. net line]. (2005) The Archive’s Available: Facebook Northerner Wayback postings, Online Machine). photos (Mirrored Retrieved incriminate dorm by Inter- the 4/20/2012. [On- http://web.archive.org/web/20051125003232/http: //www.thenortherner.com/media/paper527/news/2005/11/02/News/Facebook. Postings.Photos.Incriminate.Dorm.PartyGoers-1042037.shtml [56] TMCnews. (2006) Officials at institutions nationwide using facebook site. Technology Marketing Corporation. Retrieved 4/20/2012. [Online]. Available: http://www.tmcnet.com/usubmit/2006/03/29/1518943.htm [57] N. Hass. (2006) In your facebook.com. The New York Times. Retrieved 167 4/20/2012. [Online]. Available: http://www.nytimes.com/2006/01/08/education/ edlife/facebooks.html [58] R. Gross, A. Acquisti, and H. J. Heinz, III, “Information revelation and privacy in online social networks,” in WPES ’05: Proceedings of the 2005 ACM Workshop on Privacy in the Electronic Society. New York, NY, USA: ACM, 2005, pp. 71–80. [59] T. Krazit. (2010) What google needs to learn from buzz backlash. CNN Tech (Mirrored by the Internet Archive’s Wayback Machine). Retrieved 4/20/2012. [Online]. Available: http://web.archive.org/web/20100220075121/http: //www.cnn.com/2010/TECH/02/18/cnet.google.buzz/index.html [60] H. Tsukayama. (2012) Google faces backlash over privacy changes. The Washington Post. Retrieved 4/20/2012. [Online]. Available: http://www.washingtonpost. com/business/technology/google-faces-backlash-over-privacy-changes/2012/01/ 25/gIQAVQnMQQ story.html [61] H. Blodget. (2007) Facebook’s ”beacon” infuriate users, moveon. Business Insider. [Online]. Available: http://articles.businessinsider.com/2007-11-21/tech/ 30022354 1 facebook-users-web-sites-ads [62] R. Singel. (2009) Facebook loosens lash. WIRED. [Online]. Available: privacy controls, sparks a back- http://www.wired.com/epicenter/2009/ 12/facebook-privacy-backlash/ [63] BBC. (2010) Facebook mulls u-turn on privacy. BBC News. [Online]. Available: http://news.bbc.co.uk/2/hi/technology/10125260.stm [64] D. Rosenblum, “What anyone can know: The privacy risks of social networking sites,” IEEE Security and Privacy, vol. 5, pp. 40–49, 2007. 168 [65] A. Felt and D. Evans, “Privacy protection for social networking platforms,” Web 2.0 Security and Privacy 2008 (Workshop). 2008 IEEE Symposium on Security and Privacy, 2008. [66] J. Bonneau, J. Anderson, and G. Danezis, “Prying data out of a social network,” Social Network Analysis and Mining, International Conference on Advances in, vol. 0, pp. 249–254, 2009. [67] Facebook. (2012) Facebook platform policies. Retrieved 4/21/2012. [Online]. Available: http://developers.facebook.com/policy/ [68] S. Kelly. (2008) Identity ’at risk’ on facebook. BBC (BBC Click). [Online]. Available: http://news.bbc.co.uk/2/hi/programmes/click online/7375772.stm [69] A. Felt, P. Hooimeijer, D. Evans, and W. Weimer, “Talking to strangers without taking their candy: isolating proxied content,” in SocialNets ’08: Proceedings of the 1st Workshop on Social Network Systems. New York, NY, USA: ACM, 2008, pp. 25–30. [70] S. Buchegger, “Ubiquitous social networks,” in Ubicomp Grand Challenge, Ubiquitous Computing at a Crossroads Workshop, London, U.K., January 6-7, 2009. [71] N. Jag. (2007) MySpace friend adder bots exposed! [Online]. Available: http://www.nickjag.com/2007/08/23/myspace-friend-adder-bots-exposed/ [72] L. staff Conway. for [Online]. (2008, calling Available: its November) flyers Virgin ’chavs’. The atlantic sacks Independent 13 (UK). http://www.independent.co.uk/news/uk/home-news/ virgin-atlantic-sacks-13-staff-for-calling-its-flyers-chavs-982192.html 169 [73] WKMG. (2007) Teacher fired over MySpace page. ClickOrlando.com (WKMG Local 6) - Mirrored by the Internet Archive’s Wayback Machine. [Online]. Available: http://web.archive.org/web/20090217150126/http://clickorlando.com/ education/10838194/detail.html [74] WESH. (2006) Local sheriff’s deputy fired over myspace profile. MSNBC (WESH Orlando). [Online]. Available: http://www.wesh.com/news/9400560/detail.html [75] A. Judd. (2009, November) fired for Facebook pictures. ered Media. [Online]. Available: Ashley payne, NowPublic former - Crowd teacher Pow- http://www.nowpublic.com/strange/ ashley-payne-former-teacher-fired-facebook-pictures-2515440.html [76] L. Wu, M. Majedi, K. Ghazinour, and K. Barker, “Analysis of social networking privacy policies,” in International Conference on Extending Database Technology, 2010, pp. 1–5. [Online]. Available: http://academic.research.microsoft.com/ Publication/13267317/analysis-of-social-networking-privacy-policies [77] Proofpoint, Inc., “Outbound email and data loss prevention in todays enterprise,” Proofpoint, Inc., Tech. Rep., 2009. [78] Reputation.com. (2012) Online reputation management. Retrieved 4/21/2012. [Online]. Available: http://www.reputation.com/ [79] Zululex LLC. (2012) ZuluLex reputation management company. Retrieved 4/21/2012. [Online]. Available: http://reputation.zululex.com/ [80] D. E. Denning, “A lattice model of secure information flow,” Communications of the ACM, vol. 19, no. 5, pp. 236–243, May 1976. [Online]. Available: http://doi.acm.org/10.1145/360051.360056 170 [81] J. H. Saltzer, “Protection and the control of information sharing in multics,” Communications of the ACM, vol. 17, no. 7, pp. 388–402, Jul. 1974. [Online]. Available: http://doi.acm.org/10.1145/361011.361067 [82] D. E. Bell and L. J. LaPadula, “Secure computer systems: A mathematical model. volume ii.” MITRE Corporation Bedford, MA, 1998. [Online]. Available: http://oai. dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=AD0770768 [83] A. C. Myers and B. Liskov, “A decentralized model for information flow control,” in Proceedings of the sixteenth ACM Symposium on Operating Systems Principles, ser. SOSP ’97. New York, NY, USA: ACM, 1997, pp. 129–142. [Online]. Available: http://doi.acm.org/10.1145/268998.266669 [84] I. Papagiannis, M. Migliavacca, P. Pietzuch, B. Shand, D. Eyers, and J. Bacon, “Privateflow: decentralised information flow control in event based middleware,” in Proceedings of the Third ACM International Conference on Distributed Event-Based Systems, ser. DEBS ’09. New York, NY, USA: ACM, 2009, pp. 38:1–38:2. [Online]. Available: http://doi.acm.org/10.1145/1619258.1619306 [85] A. Futoransky and A. Waissbein, tions,” in Conference line]. Available: on Privacy, “Enforcing privacy in web applicaSecurity and Trust, 2005. [On- http://academic.research.microsoft.com/Publication/1906067/ enforcing-privacy-in-web-applications [86] Apache Software Foundation. (2011) Apache accumulo. Retrieved 4/21/2012. [Online]. Available: http://accumulo.apache.org/ [87] R. Wahbe, S. Lucco, T. E. Anderson, and S. L. Graham, “Efficient software-based fault isolation,” SIGOPS Operating Systems Principles Review, vol. 27, no. 5, 171 pp. 203–216, Dec. 1993. [Online]. Available: http://doi.acm.org/10.1145/173668. 168635 [88] I. Goldberg, D. Wagner, R. Thomas, and E. A. Brewer, “A secure environment for untrusted helper applications confining the wily hacker,” in Proceedings of the 6th conference on USENIX Security Symposium, Focusing on Applications of Cryptography - Volume 6, ser. SSYM’96. Berkeley, CA, USA: USENIX Association, 1996, pp. 1–1. [Online]. Available: http://dl.acm.org/citation.cfm?id=1267569.1267570 [89] S. Maffeis and A. Taly, “Language-based isolation of untrusted javascript,” in Computer Security Foundations Workshop, 2009, pp. 77–91. [Online]. Available: http://academic.research.microsoft.com/Paper/5113449.aspx [90] Google et al. (2007) google-caja: Compiler for making third-party HTML, CSS and JavaScript safe for embedding. Retrieved 4/21/2012. [Online]. Available: http://code.google.com/p/google-caja/ [91] W3C. (1998) Platform for privacy preferences (P3P) project. World Wide Web Consortium (W3C). Retrieved 4/21/2012. [Online]. Available: http: //www.w3.org/P3P/ [92] Microsoft Corporation. (2012) A history of internet explorer. Retrieved 4/21/2012. [Online]. Available: http://windows.microsoft.com/en-us/internet-explorer/ products/history [93] G. Karjoth, M. Schunter, and E. V. Herreweghen, “Translating privacy practices into privacy promises -how to promise what you can keep,” in Policies for Distributed Systems and Networks, 2003, pp. 135–146. 172 [Online]. Available: http://academic.research.microsoft.com/Publication/664496/ translating-privacy-practices-into-privacy-promises-how-to-promise-what-you-can-keep [94] P. Ashley, S. Hada, G. Karjoth, and M. Schunter, “E-P3P privacy policies and privacy authorization,” in Proceedings of the 2002 ACM Workshop on Privacy in the Electronic Society, ser. WPES ’02. New York, NY, USA: ACM, 2002, pp. 103–109. [Online]. Available: http://doi.acm.org/10.1145/644527.644538 [95] K. ing Ghazinour an and enforceable K. Barker, lattice-based “Capturing structure,” P3P pp. semantics 1–6, 2011. us[On- line]. Available: http://academic.research.microsoft.com/Publication/39237918/ capturing-p3p-semantics-using-an-enforceable-lattice-based-structure [96] A. Rezgui, M. Ouzzani, A. Bouguettaya, and B. Medjahed, “Preserving privacy in web services,” in Proceedings of the 4th International Workshop on Web Information and Data Management, ser. WIDM ’02. New York, NY, USA: ACM, 2002, pp. 56–62. [Online]. Available: http://doi.acm.org/10.1145/584931.584944 [97] M. A. C. Dekker, S. Etalle, and J. den Hartog, “Privacy in an ambient world,” Enschede, April 2006, imported from CTIT. [Online]. Available: http://doc.utwente.nl/66174/ [98] M. W. Bagga, “Privacy-enabled application scenarios for web services,” 2003. [Online]. Available: http://academic.research.microsoft.com/Publication/ 4655282/privacy-enabled-application-scenarios-for-web-services [99] W. Xu, “A V. N. Venkatakrishnan, framework for building R. Sekar, privacy-conscious in International Conference on Web Services, and I. V. Ramakrishnan, composite 2006, web services,” pp. 655–662. [On- 173 line]. Available: http://academic.research.microsoft.com/Publication/2362859/ a-framework-for-building-privacy-conscious-composite-web-services [100] K. El-Khatib, “A privacy negotiation protocol for web services,” Workshop on Collaboration Agents: tive Environments. Halifax, Information Technology; line]. Available: in Autonomous Agents for Collabora- Nova Scotia, Canada: NRC Institute for National Research Council Canada, 2003. [On- http://academic.research.microsoft.com/Publication/4507596/ a-privacy-negotiation-protocol-for-web-services [101] S. Benbernou, H. Meziane, Y. H. Li, and M.-S. Hacid, “A privacy agreement model for web services,” in Services Computing, 2007. SCC 2007. IEEE International Conference on, july 2007, pp. 196 –203. [102] D. Métayer, “Formal aspects in security and trust,” P. Degano, J. Guttman, and F. Martinelli, Eds. Berlin, Heidelberg: Springer-Verlag, 2009, ch. A Formal Privacy Management Framework, pp. 162–176. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-01465-9 11 [103] B. Luo and D. Lee, “On protecting private information in social networks: A proposal,” in Data Engineering, 2009. ICDE ’09. IEEE 25th International Conference on, 29 2009-april 2 2009, pp. 1603 –1606. [104] R. Hamadi, ing of vanced H. young Paik, privacy-aware Information line]. Available: web Systems and B. Benatallah, service protocols,” Engineering, in 2007, “Conceptual modelConference pp. on 233–248. Ad[On- http://academic.research.microsoft.com/Publication/2419338/ conceptual-modeling-of-privacy-aware-web-service-protocols 174 [105] S. E. site Levy privacy Wide and C. policies Web Gutwin, with Conference Available: “Improving fine-grained Series, understanding policy 2005, anchors,” pp. of web- in 480–488. World [Online]. http://academic.research.microsoft.com/Publication/1242366/ improving-understanding-of-website-privacy-policies-with-fine-grained-policy-anchors [106] Y. Tian, B. Song, and E. preservation system in ference Hybrid Information on line]. Available: nam untrusted Huh, “A environment,” Technology, threat-based in 2009, privacy International pp. 102–107. Con[On- http://academic.research.microsoft.com/Publication/6054821/ a-threat-based-privacy-preservation-system-in-untrusted-environment [107] S. Guha, K. Tang, and P. Francis, “NOYB: privacy in online social networks,” in WOSP ’08: Proceedings of the First Workshop on Online Social Networks. New York, NY, USA: ACM, 2008, pp. 49–54. [108] M. M. Lucas and N. Borisov, “FlyByNight: mitigating the privacy risks of social networking,” in WPES ’08: Proceedings of the 7th ACM Workshop on Privacy in the Electronic Society. New York, NY, USA: ACM, 2008, pp. 1–8. [109] R. Baden, A. Bender, N. Spring, B. Bhattacharjee, and D. Starin, “Persona: an online social network with user-defined privacy,” SIGCOMM Computer Communication Review, vol. 39, no. 4, pp. 135–146, 2009. [110] World Wide Web Consortium (W3C). (2013) Cross-origin resource sharing. [Online]. Available: http://www.w3.org/TR/cors/ [111] A. Barth, A. P. Felt, P. Saxena, and A. Boodman, “Protecting browsers from extension vulnerabilities,” in Network and Distributed System Security Symposium, 175 2009. [Online]. Available: http://academic.research.microsoft.com/Publication/ 6357389/protecting-browsers-from-extension-vulnerabilities [112] A. P. Felt, K. Greenwood, and D. Wagner, “The effectiveness of install-time permission systems for third-party applications,” UC Berkeley, Tech. Rep. EECS-2010143, 2010. [113] M. Fredrikson and B. Livshits, “REPRIV: Re-envisioning in-browser privacy,” Microsoft Research, Tech. Rep. MSR-TR-2010-116, 2010. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.173.13 [114] L. Sweeney, “Achieving k-anonymity privacy protection using generalization and suppression,” International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems, vol. 10, no. 5, pp. 571–588, Oct. 2002. [Online]. Available: http://dx.doi.org/10.1142/S021848850200165X [115] Facebook. (2012) User - facebook developers (graph API). [Online]. Available: https://developers.facebook.com/docs/reference/api/user/ [116] ——. (2012) Graph API. [Online]. Available: https://developers.facebook.com/ docs/reference/api/#insights [117] J. Anderson, “Computer security technology planning study,” Command and Management Systems HQ Electronic Systems Divisions (AFSC), Tech. Rep., 1972. [118] V. Stinner. (2012) pysandbox 1.5: Python package index. [Online]. Available: http://pypi.python.org/pypi/pysandbox [119] M. Muhammad and J. Cappos. (2012) Seattle: Open peer-to-peer computing. Computer Science & Engineering at the University of Washington. [Online]. Available: https://seattle.cs.washington.edu/html/ 176 [120] World Wide Web Consortium (W3C). (2000) SOAP (simple object access protocol). [Online]. Available: http://www.w3.org/TR/soap/ [121] R. T. Fielding, “Representational state transfer (REST), from architectural styles and the design of network-based software architectures,” Ph.D. dissertation, University of California, Irvine, 2000, retrieved on 4/23/2012. [Online]. Available: http://www.ics.uci.edu/∼fielding/pubs/dissertation/rest arch style.htm [122] Google. (2011) Introducing protorpc for writing app engine web services in python. [Online]. Available: http://googleappengine.blogspot.ca/2011/04/ introducing-protorpc-for-writing-app.html [123] ——. (2012) Google app engine. [Online]. Available: https://developers.google. com/appengine/ [124] ——. (2012) Google app engine: Pure python. [Online]. Available: https: //developers.google.com/appengine/docs/python/runtime#Pure Python [125] J. Simon-Gaarde. (2011) Ladon webservice. [Online]. Available: http://packages. python.org/ladon/ [126] Arskom Bilgisayar Danışmanlık ve Tic. Ltd Şti. (2011) Rpclib. [Online]. Available: https://github.com/arskom/rpclib [127] Unknown. (2011) pysimplesoap: Python simple soap library. [Online]. Available: http://code.google.com/p/pysimplesoap/ [128] J. Ortel, J. Noehr, and N. V. Gheem. (2010) Suds. [Online]. Available: https://fedorahosted.org/suds/ [129] Boomi Inc. (2008) appengine-rest-server: REST server for google app engine applications. [Online]. Available: http://code.google.com/p/appengine-rest-server/ 177 [130] Aaron Swartz et al. (2012) Welcome to web.py. [Online]. Available: http: //webpy.org/ [131] Armin Ronacher et al. (2012) Flask - web development, one drop at a time. Pocoo. [Online]. Available: http://flask.pocoo.org/ [132] Marcel Hellkamp et al. (2012) Bottle: Python web framework. [Online]. Available: http://bottlepy.org/docs/dev/ [133] Yahoo! Inc. (2012) Make yahoo! web service rest calls with python. Retrieved on 4/24/2012. [Online]. Available: http://developer.yahoo.com/python/python-rest. html [134] Genivia Inc. (2012) The gSOAP toolkit for SOAP web services and XML-based applications. [Online]. Available: http://www.cs.fsu.edu/∼engelen/soap.html [135] Apache Software Foundation. (2012) Web services project @ Apache. [Online]. Available: http://ws.apache.org/soap/ [136] J. King and J. Kawash, “A real-time XML protocol for bridging virtual communities,” International Journal of Networking and Virtual Organisations, vol. 9, no. 3, pp. 248–264, September 2011. [Online]. Available: http: //dx.doi.org/10.1504/IJNVO.2011.042482 [137] W. W. W. Consortium. (2013) Web storage. [Online]. Available: http: //dev.w3.org/html5/webstorage/#disk-space [138] G. dows able: Keizer. 7 (2009) security Microsoft change. cites ’click Computerworld. fatigue’ for [Online]. winAvail- http://www.computerworld.com/s/article/9127458/Microsoft cites click fatigue for Windows 7 security change?taxonomyId=17&pageNumber=2 178 [139] World Wide Web Consortium. (2009) Efficient XML interchange evaluation. [Online]. Available: http://www.w3.org/TR/2009/WD-exi-evaluation-20090407/ [140] World Wide Web Consortium (W3C). (2001) W3C Semantic Web Activity. [Online]. Available: http://www.w3.org/2001/sw/