Exploring and Improving BitTorrent Swarm Topologies
Transcription
Exploring and Improving BitTorrent Swarm Topologies
Distributed Computing Exploring and Improving BitTorrent Swarm Topologies Master’s Thesis Christian Decker [email protected] Distributed Computing Group Computer Engineering and Networks Laboratory ETH Zürich Supervisors: Raphael Eidenbenz Prof. Dr. Roger Wattenhofer March 5, 2012 Abstract BitTorrent is the most popular P2P protocol on the Internet but it causes unnecessary waste of bandwidth by disregarding the underlying network topology. Until now the behavior of BitTorrent swarms has either been studied in controlled environments or by analyzing part of the traffic it causes. In this thesis we propose a method to explore live swarms on the Internet and then use it to verify if any locality improving strategies are being used. We then propose a lightweight method to suggest geographically close peers in order to improve the locality of active swarms. i Contents Abstract i 1 Introduction 1 1.1 Overview and Related Work . . . . . . . . . . . . . . . . . . . . . 2 1.2 The BitTorrent Protocol . . . . . . . . . . . . . . . . . . . . . . . 3 2 Exploring Swarm Topologies 5 2.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Scan complexity . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Error detection . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.3 Invisible portion of the swarm . . . . . . . . . . . . . . . . 10 2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 Improving Swarm Topologies 18 3.1 Geographical distance . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Are BitTorrent swarms locality-aware? . . . . . . . . . . . . . . . 19 3.3 Why PEX? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Implementation with PEX . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4 Conclusion 27 Bibliography 28 A Working with Planetlab A-1 ii Chapter 1 Introduction BitTorrent has rapidly gained popularity over the last years. Out of the 55% of the global internet traffic that is attributed to peer-to-peer (P2P) traffic close to a third is due to BitTorrent[11] making it the most used P2P protocol. Its scalability, reduced download times and resilience compared to traditional client-server based systems make it the de-facto standard for the distribution of large files. Its dominance in the P2P world makes it the perfect candidate for improvements because even little changes can have a large impact. One of the most discussed topics is the fact that during the construction of the ad-hoc networks that share content, the underlying network topology is not taken into consideration, causing unnecessary bandwidth consumption[6]. Besides attempts to improve the network locality the high bandwidth consumption has also prompted some Internet Service Providers (ISPs) to start filtering and throttling BitTorrent traffic[3]. During this thesis we were able to contribute: 1. A method for mapping BitTorrent swarm topologies without actively participating in the filesharing, allowing us to explore and map real swarms currently active on the internet. 2. Use our method to verify that there is no large scale use of locality improving strategies. 3. Propose the geographical distance as an offline estimation method for the locality of peers. 4. Use a large view of the swarm to improve the swarm topology by suggesting geographically close neighbors to the peers. In the remaining part of Chapter 1 we give a brief overview of the BitTorrent protocol and point to related work. In Chapter 2 we introduce a method for exploring and mapping the topology of BitTorrent swarms and test it against live swarms to see how it performs. Finally in Chapter 3 we propose a lightweight 1 1. Introduction 2 method for improving the swarm topology and use our topology mapping mechanism to verify its effectiveness. 1.1 Overview and Related Work The first part of this thesis attempts to detect the topology of a swarm. Fauzie et al. [4] try to infer the topology by waiting for a peer to reveal its neighbors via a peer exchange (PEX) message. The view of the network that is gained with this method is not complete, since PEX messages only contain newly joined peers and peers do not announce when closing connections. During our initial experiments we also found that the implementations disagree about the semantics of the PEX message: some clients forward PEX messages without having established a connection to the peers in the PEX message, thus producing false positives. While it is true that a PEX message from a peer proves that the peer knows about the advertised peers, it does not imply an active connection. During the second part we use our measurement method to determine whether there is a tendency to create locality aware swarms and then implement a suggestion method to improve the locality of swarms. Locality improvement has been the goal of many studies but as we will see in Section 3.2 none have been adopted at a large scale. Previous studies attempting to improve locality can be generally divided into three categories, according to which part of the system is trying to improve the locality: the local client by making an informed decision about which peers to contact, the tracker by returning only peers that are close to the announcing peer or the the ISP trying to limit outside communication to reduce peering costs by interfering with outside communication. For the client trying to chose better connection to open there have been the proposals by Ren et al[10] which uses a combination of active and passive measurements of the underlying link, Choffnes et al[2] attempt to identify peers close to each other by using CDN domain name resolution as oracles and grouping the peers according to the DNS resolution behavior. In [9] Pereira et al propose an Adaptive Search Radius based on TTL and active pinging of peers to increase the neighborhood until the availability is guaranteed. Sirivianos et al [12] use a large view exploit to free-ride. While they rely on trackers being tricked into providing a larger than normal view of the swarm we use proxies on PlanetLab to achieve the same effect. But instead of using the large view to free-ride we use the gathered information to improve the locality of the swarm. For the ISP attempting to force clients to limit the communication to inside their boundaries Bindal et al[1] propose a Biased Neighborhood Selection (BNS) method, by intercepting tracker requests and responding with local peers. Another study by Tian et al. [13] proposes to place ISP owned peers that act as a local cache for the content to local peers. Finally Xie et al. propose P4P 1. Introduction 3 [14] to expose the network topology and communicate ISP policies to the P2P applications. Most of the mentioned studies use an active measuring method in which the locality is measured on the actual network, for example by measuring the time to live field in the IP protocol. This means that the locality of a peer can only be established by either an active measurement (TCP ping, ICMP ping, . . . ) or passively once a connection has already been established. We argue that using the geographical distance as a first locality estimate allows a peer to make offline decisions as to which peers should be preferred. 1.2 The BitTorrent Protocol BitTorrent is a P2P filesharing protocol. Unlike Gnutella or eMule, BitTorrent does not include mechanisms to search and discover content, instead it relies on metadata files called torrents to be distributed out of band. The torrent file contains the details of the content that is shared, an info hash which uniquely identifies the torrent, a tracker that is used to establish the first contact and SHA1 hashes for the content in order to allow integrity verification of the content. Peers interested in sharing the content detailed in the torrent file form small ad-hoc networks called swarms. A tracker is used to bootstrap the communication. While not actively participating in the sharing of the content, a tracker collects peers that participate in the swarm. To join a swarm a peer contacts the tracker and announces its interest to join. The tracker will then take a random subset of the peers matching the same info hash and return them to the newly joining peer. The joining peer will then start connecting to the peers he received in the tracker response. Clients usually keep a connection pool of around 50 connections active. The tracker may use a variety of transports, the most common being the traditional HTTP transport and a simplified UDP transport. Lately a distributed hashtable (DHT) tracker, based on the Kademlia protocol[8], has also been gaining popularity due to its distributed and fault tolerant nature. Once connected to other peers in the swarm a peer starts trading pieces it has for pieces it does not yet have. Peers trade with a fixed number of neighbors at once called the active set or upload slots in some clients. If a connection is added to the active set the connection is said to be unchoked, while a connection removed from the active set is said to be choked and the peer will not serve piece requests over that connection. To bootstrap the trading the original protocol includes optimistic unchokes, which means that it randomly unchokes connections even though it has not traded with that neighbor in the past. Optimistic unchokes allow peers with no piece to receive a few pieces and then be able to trade later and it allows the peer granting the optimistic unchoke a chance to 1. Introduction 4 find faster peers to trade with. Peers in a swarm are divided into two groups: • Seeds: peers that have all the pieces of the torrent staying in order to speed up the download of other peers. • Leechers: peers that do not have all the pieces and are actively trading in order to complete the download. There have been various extensions to the original protocol including the Azureus Messaging Protocol1 , the LibTorrent extension protocol2 , protocol obfuscation and an alternative transport protocol based on UDP (uTP). The Azureus Messaging Protocol and the LibTorrent extension protocol implement a peer exchange mechanism used to inform neighbors of the existence of other peers in the swarm, allowing them to be contacted as well. uTP and protocol encryption are mainly used to avoid ISP throttling or improve congestion control and do not alter the behavior of the clients. 1 2 http://wiki.vuze.com/w/Azureus_messaging_protocol http://www.rasterbar.com/products/libtorrent/extension_protocol.html Chapter 2 Exploring Swarm Topologies In this Section we describe a method for analyzing the topology of live BitTorrent swarms under real world conditions. Using this method we are able to detect connections between peers in a swarm, while not participating in the filesharing, and without having to any particular access to the software running at the peers. 2.1 Method After establishing a connection the BitTorrent protocol requires the exchange of handshake messages before starting to trade pieces. A BitTorrent handshake message includes a protocol identification string, the info hash of the torrent for which the connection was opened, a reserved bitfield used to signal protocol extension support and a 20 byte random peer id. The peer id serves the purposes of identifying multiple connections between the two peers and avoid connecting to oneself. Although multiple connections are checked by keeping track of the connected IP addresses, checking the peer id is still required if clients use multiple addresses or the client not recognize its own IP address, for example if it is behind a network address translation (NAT). If the receiving peer already has another connection to a peer with the same peer id we call it a peer id collision. In our experiments we observed that all tested clients drop a new connection if they detect a peer id collision. The peer initiating the connection is required to send its handshake message first. If the receiving peer detects a collision it will not send its handshake message and close the connection immediately. The fact that the second part of the handshake is not sent if there is a collision can be exploited to identify existing connections between peers. We can contact a peer and attempt to complete a handshake with each peer id we know in order to find out if the peer has an open connection to the peers corresponding to the peer ids. Knowing the peer ids and the addresses of the peers in the swarm we create a two dimensional search space consisting of all possible peer × peer id tests. A 5 2. Exploring Swarm Topologies 6 test is a tuple of handshake parameters and a target peer with which to attempt the handshake. We call a complete handshake a negative result because it implies that there is no connection from the peer to the peer id, an incomplete handshake is called a positive result. A" I'm , o l l "He B Scanner A Figure 2.1: A simplified view of the method: the scanner (blue) opens a connection to peer B and attempts to complete a handshake pretending to be peer A. If the handshake succeeds there is no connection between peers A and B. To test whether the scanning works we created a small swarm on PlanetLab using Azureus and uTorrent and wrote a simple Python script to test each possible connection once. The sample scan in Figure 2.2 was performed by a single host. We then used the MaxMind IP Geolocation1 library to retrieve the geographical location of the peers and overlayed the graph on top of a map using the matplotlib library2 . By checking with the output of the netstat command we were able to verify that each connection detected by our script corresponds to a real connection. Notice that the graph is nearly complete. This is because the connection pool size is larger than the number of peers in the swarm, and every peer tries to connect to as many peers as possible to guarantee the best selection for the trading of pieces. The only connections missing in Figure 2.2 are the ones between the seeds as those connections will be closed. Our method scans a O(n2 ) search space created by the combinations of peers × peer ids but the runtime grows with O(n). While each peer can be scanned only for a limited number of peer ids at once we can test all the peers in parallel. The limited number of parallel tests for one peer is caused by the limited number of connections the peer will accept before dropping incoming connections, which would produce a lot of false positives (see Section 2.2.2). 1 2 http://www.maxmind.com http://matplotlib.sourceforge.net/ 2. Exploring Swarm Topologies Figure 2.2: A sample scan using a swarm on PlanetLab 7 2. Exploring Swarm Topologies 2.2 8 Challenges Although the test using the small PlanetLab swarm did prove that the method works, it does not translate directly to real world swarms. In this Section we enumerate the problems we encountered while moving from the PlanetLab swarm to real world swarms. We can categorize the problems into 3 groups: • Scan complexity • Error detection • Invisible portions of the swarm 2.2.1 Scan complexity Most importantly the scanning takes time to complete. This might be negligible in a small swarm, but the number of tasks to check grows with O(n2 ). This means that we will never have a consistent picture of the swarm in its entirety, because we cannot assume that the topology does not change during the scan. In order to have the most consistent view of the topology as possible we have to make the scan as fast as possible. We therefore end up with a trade off between consistency and completeness. In order to know what scanning time is acceptable we instrumented BitThief[7] to collect the connection lifetime while participating in 11 swarms. After filtering out connections that were not used to transfer pieces of the torrent we end up with the cumulative distribution function (CDF) in Figure 2.3. From the distribution we can see that 85% of the connections are open for longer than 400 seconds. The number of tests grows with O(n2 ), but the time needed to complete the scan grows with O(n) since we can parallelize testing peers and we are left with iterating over the discovered peer ids. We would therefore expect the network coverage to decrease linearly with increasing swarm size. 2.2.2 Error detection If a handshake is complete, that is we get the handshake response from the peer, we can be sure that the connection we tested for does not exist. This means that we cannot have false negatives while testing. The other way around this does not hold: an aborted handshake does not imply an existing connection, meaning that false positives are a possibility and as we will see later happen quite frequently. Causes for an aborted handshake include: 2. Exploring Swarm Topologies 9 0.6 0.4 0.0 0.2 CDF of connections 0.8 1.0 CDF of connection uptimes 0 500 1000 1500 2000 2500 3000 3500 Uptime [s] Figure 2.3: Cumulative distribution function of the connection lifetime. • Peers with full connection pools: the peers limit the number of connections both per torrent and globally. uTorrent and Azureus both use a per torrent connection limit of 50, and a global connection limit of 200. Should the limit be reached any new connection will be dropped immediately. • Blocking peers: a peer might decide to block us while we scan either as a result of our behavior or the fact that we use University owned IP addresses. One such example is the Azureus SafePeer plugin which loads the PeerGuardian blocklists. • Torrent removed: as soon as a torrent has been removed a peer will start dropping connections related to that torrent. These cases are particularly unpleasant since we are able to establish a connection, but it is later dropped, because the blocking happens at application layer and not at the transport layer. As a first step to filtering out false positives we have to distinguish dropped connections caused by our part of the handshake and connections dropped due to other reasons. By delaying our handshake by a few milliseconds we can discover when a peer has started to block us. We therefore ignore all tests where the connection was dropped too soon for it to be a reaction to our handshake. Should the peer be blocking us from the beginning we will never get a handshake and we can therefore drop all tests from those peers. Most false positives however result from either the per torrent pool being full and to a lesser degree the torrent being removed. For these cases the peer needs the information from our handshake, thus the observable behavior is identical 2. Exploring Swarm Topologies 10 to a positive result. To get an estimate of the amount of false positive in those cases we introduced some random peer ids that were not used by other peers in the swarm which should therefore produce only negatives. By comparing the number of positive results to the total number of tests scheduled for the peer ids we introduced we can estimate the ratio of false positives during our scan. The peer ids we introduced can easily be filtered out during the analysis. Additionally we decided to reschedule tasks that resulted in a positive to retest them again later. The probability of a test resulting in multiple false positives when being retested decreases exponentially. While rescheduling tests quickly reduces the number of false positives it might also lead us to revoke real revoked connection later during the scan because it has been closed in the meantime. Finally we had a problem with Azureus reporting only negatives. Azureus checks during a handshake whether the advertised capabilities (reserved bitfield in the handshake) match those it expects for the peer id. If they do not match it prefixes the peer id internally with a string ”FAKE ”, thus not producing a collision with the real peer id. This was solved simply by tracking the reserved bits for each peer id and implementing the protocol extensions AZMP and LTEX. 2.2.3 Invisible portion of the swarm There are some parts of the swarm we will not be able to scan. A peer might never accept a connection from us, either because it has a full connection pool or because it is blocking us, in which case we will not be able to make any direct claim to which other peers it is connected. While this is unfortunate the false positives are relatively easy to filter out during the final analysis, by simply removing all positive results coming from a peer that has never produced a negative result. Adding to the invisible part of the swarm are peers behind NATs or firewalls, that will not be reachable from outside. Besides discovering the connections of a peer directly by scanning the peer itself, we can also indirectly discover connections of a peer by scanning for its peer id at the other peers in the swarm. Indirect scanning allows us to detect connections of peers that cannot be scanned directly, while it also helps eliminate false positives since each connection is tested at both endpoints. To discover the peer ids of peers behind a NAT or firewall we listen for incoming connections. If a NATed peer connects to us he will send us his peer id and we will add it to the set of known peer ids and include it in the scan. 2.3 Implementation We decided to create a minimal BitTorrent protocol implementation instead of refactoring the BitThief client because it would have been difficult to instrument 2. Exploring Swarm Topologies 11 the rather complex client to our specific needs. The implementation of the BitTorrent protocol is written in Python and based on the Twisted Framework3 . Our implementation supports the entire of the BitTorrent protocol but it does not implement the logic to retrieve pieces and reconstruct the shared content. The reason we did not implement the retrieval logic was that it is not currently needed for our scans. Besides the original BitTorrent protocol we also implemented parts of the Azureus Messaging Protocol and LibTorrent Protocol Extension. The protocol extensions were needed since we want to mimic the behavior of the peers we are impersonating and the original protocol does not include a peer exchange mechanism, which was very important in the last part of this project. The scanner is divided into two separate parts: • A single gatherer that initiates the scan, collects the announce responses, enumerates the tests, schedules tests to be executed and collects the results of the tests. • Multiple hunters running on PlanetLab nodes that announce to the tracker, execute assigned tests and return the results to the gatherer. The gatherer initiates the scan by loading a torrent file, parsing its contents and distributing it to the hunters. The hunters are processes that run on the PlanetLab nodes. They create a listening BitTorrent client for each torrent they get issued, announce to the trackers and get assigned tests by the gatherer. A test consists of a tuple containing the handshake information to be tested and the address of the peer to test them with. Upon receiving a task the hunter opens a new connection to the address in the task and attempts a handshake with the information from the task. A test may result in a connection failure (connection timeout or connection refused), a completed handshake or a rejected handshake. The result is sent to the gatherer without further interpretation on the hunter side. Unlike our sample swarm which used the peer id along with the peer address, the peer ids from successful handshakes. stepwise as he learns new peer ids or new a tracker software that also returned with most trackers we have to learn The gatherer builds the search space peer addresses. For each peer a fixed number of tasks is assigned at any time. The number of concurrent tests for a single peer is called concurrency. With too many concurrent connection attempts we might end up producing a lot of false positives by filling up the connection pool at the peer. The advantage of having a higher concurrency is that we can get a far greater coverage of the network. The gatherer collects peers from the announcements and from PEX messages and together with the peer ids collected from successful handshakes it enumerates 3 http://twistedmatrix.com/ 2. Exploring Swarm Topologies 12 all possible tests in the search space and places them in a task queue. The search space is extended in one direction when a new peer id is discovered and in the other if a new peer is discovered. Each test consists of a peer id, the corresponding reserved flags to mimic the original behavior and the address of the peer this test has to be run against. From the task queue the gatherer then assigns tests to the hunters. Gatherer Hunters BitTorrent Swarm Figure 2.4: Architecture of our implementation: the gatherer (green) connects to the hunters (blue) on the PlanetLab nodes. The hunters contact the peers in the swarm (red) to execute the tests and report the result back to the gatherer. The hunters do not evaluate any results or act on them, they just execute the tests the gatherer assigns them. Initially we attempted to partition the search space to distribute the task of enumerating the tests and scheduling them, but later decided to centralize these tasks to the gatherer to limit concurrency. Having the gatherer take care of the scheduling creates longer than necessary round trip times but we felt it to be necessary in order to keep the number of open connections to the peers under control. Besides limiting the concurrency having the gatherer as centralized scheduler makes the hunters simpler and allows for a more flexible task assignment. With the gatherer as centralized scheduler we can take different hunter speeds into account and reschedule tests should the connection to a hunter be lost. To eliminate false positives the test is rescheduled for re-execution at a later time if the test results in either a connection failure or a positive. There are two separate schedulers that take care of choosing the next task to be executed against a peer in the swarm and which hunter node gets the task. For the task queue we used either a simple FIFO queue or a priority queue in which untested connections were given a higher priority. The priority queue should allow us to get a higher coverage of the network at the expense of discovering false positives. During our scans we found that the performance of both queues was nearly identical, so we ended up using only the priority queue. 2. Exploring Swarm Topologies 13 The schedule of which hunter gets the next test turned out to be important. Initially we used a round robin scheduler to distribute the tests, meaning that each hunter would get the same number of tests to execute independently of how many tasks are currently pending for that hunter. In some cases a slow hunter on a busy PlanetLab node or with high latency would end up with all pending tests assigned to him and all the other hunters would be idle until it returned some results. We later implemented a least load scheduler which would pick the hunter with the fewest pending assigned tests which greatly improved the scan speed. A single hunter can participate in many scans concurrently: a gatherer connecting to a hunter causes a new BitTorrent client to be spawned at the hunter. This allows us to keep the state of different scans to be separated, which in turn allows us to multiplex multiple scans on the same hunters. The gatherer saves all the results into a JSON4 formatted log file and the evaluation of the scans is done after the scan by a separate script. This allows us to modify the analysis without having to rerun the scan and it keeps the gatherer simple. 2.4 Evaluation Using the implementation in Section 2.3 we are able to scan live swarms. To evaluate the performance of our method we start with a birds eye view of the swarms and investigate what part of them is observable in a best case scenario, then we analyze how well we are covering the observable part of a swarm and discuss the precision of our findings. An average of 42.23% of the IPs returned by the trackers were reachable during our tests, the remaining peers either rejected connections or timed out. 57.76% of the peers in the swarm are therefore either behind a NAT or a firewall, or they left the swarm before we contacted them. If we were to assume that connections between NATed peers are just as likely as between non-NATed we would only be able to detect 100% − 57.76%2 = 66.63% of the connections in a swarm. We argue however that establishing a connection between two NATed peers is far less likely. The only option for two firewalled peers to establish a connection is by using NAT-holepunching[5]. uTorrent version 1.9 introduced the UDP based uTP, which also supports NAT-holepunching. According to an aggregate statistic of 2 of the largest trackers5 only 8% of all uTorrent peers are running a client version with NAT-holepunching support. We may therefore assume that we detect a much larger part of the network and that 66.63% is a lower bound. 4 5 http://www.json.org/ http://ezrss.it/stats/all/index.html 2. Exploring Swarm Topologies 14 14.43% of the peers accepting incoming connections did never complete a handshake, and are therefore in the part of the swarm that we were unable to scan. The probability of a connection being undetected because it was established between two unscannable peers is 14.43%2 = 2.08%. By combining the unreachable peers and the peers that did not allow us to scan them, we can a lower limit for what part of the network we are able to scan 66.63% − 2.08% = 64.55%. To test the percentage of the network we are able to test we ran several scans and compared the swarm size to the coverage we were able to achieve. The swarm size in this case is represented as the number of unique peer ids. To calculate the coverage we used the following formula: coverage = executed tests (peer ids − 1) × peers To see if a higher concurrency allows for a higher coverage we ran several scans with varying concurrency levels and varying swarm sizes. 1.0 Coverage vs. Swarm Size 0.8 ● ● ● ● ●● ● ●● ● ●● ● ● ●●● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 0.6 ● ● 0.4 Coverage (%) Concurrency 1 Concurrency 2 Concurrency 3 Concurrency 4 Concurrency 5 ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● 0.2 ● ● ● ● ●● 0.0 ● ● ● ● ● ● ● ● ● ● ● 0 200 400 600 800 1000 #Peer IDs Figure 2.5: Network coverage by swarm size for various concurrency settings As expected the coverage of the network decreases linearly with the size of the network that is being scanned, despite the number of tests growing quadratically. Churn is the main reason for the low coverage as peers leaving the swarm make 2. Exploring Swarm Topologies 15 parts of the search space unscannable and newly joining peers extend the search space. Peers joining at a later time will be added to the search space but our scanner will have less time to complete the scan on those peers. Additionally peers greatly vary in response times, which we measure as time from a task being scheduled to its result arriving to the gatherer. Some peers have an average response time of just under 30 seconds (the connection timeout on the hunters) which would allow us to test just 13 peer ids in 400 seconds. 1.0 The reason for the small difference between the concurrency levels is that the coverage quickly converges to its final level. To prove this we ran 5 scans with different concurrency levels. We then parsed the logs with increasing timeframes to simulate shorter scans. We ended up with 5 sets of 50 scans with 10 second increments. By plotting the coverage in relation to the time we get Figure 2.6. For higher concurrency levels the scan converges faster towards the final coverage which also explains why prioritizing untested connections using a priority queue does not differ as much as expected from a FIFO queue. The occasional drops in coverage are due to new peer ids being discovered which increase the search space and therefore lowering the coverage. ● ● ●●● ●●● ● ●● ●● ●● ●● ●● ●● ●●●●●●● ●● ● ●●●●●● ●●● ●●● ●●●●●●●● ●● ●● ●● ●●●● ● ●●●●●●●● ●●●● ●●●●●●● ●● ●● ●●● ● ● ●●● ●●●●●●●●●● ●●●●● ●●●● ●●● ●●●● ● ●● ● ● ●●● ●●● ●● ●●●●● ● ● ●● ●●● ●●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ●● ●● ● ● ● ●●● ●● ●●● ● ● ● ●●●● ●●● ●●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ●● ● 0.8 ● 0.6 ● ●● ● ● ● ● ● ● ● ● ● ● 0.4 Coverage (%) ● ●● ● ● Concurrency 1 Concurrency 2 Concurrency 3 Concurrency 4 Concurrency 5 0.0 0.2 ● ● ● ● 0 100 200 300 400 500 Scan time [s] Figure 2.6: Evolution of the coverage in a 260 peer swarm over time for different concurrency levels. Finally we have to quantify the errors introduced by false positives. To 2. Exploring Swarm Topologies 16 detect false positives we rescheduled tests with a positive result to test them again later. While retesting allows us to eliminate false positives, it also causes us to lose real connections that have been correctly identified during the first test and have closed in the meantime. To quantify the amount of connections we lose during the retesting and how effective the retesting is to identify false positives we ran a scan over a longer period of 1000 seconds and then parsed the log with increasing timeframes to simulate shorter scans. We ended up with 100 scans with 10 second increments each. For each scan we then calculated the number of connections (tests that produced a positive at first) and the number of revoked connections (tests that resulted in a positive but were negative during the retesting). By calculating the ratio of revoked connections to the detected connections we get Figure 2.7. 0.6 0.4 ● ●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●● ●●●●●● ●●●●●● ● ● ● ● ●●● ●●●● ●●● ●●● ●●● ●● ● ●● ●● ●● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 Revoked connections (%) 0.8 1.0 Revoked connections ● ● ● 0 200 400 600 800 1000 Scan time [s] Figure 2.7: Ratio of revoked connections in relation to the scan time duration. The swarm scanned for this purpose had 303 distinct peer id s and a total of 451 distinct IP addresses. The red line is an approximation of the good connections being revoked due to retesting. We observe that after 400 seconds the revocation rate of detected connection quickly becomes linear. In this linear part of the plot the revocation rate due to false positives has become negligible and the increase in revoked connections is due to real connections being closed causing the following tests to produce 2. Exploring Swarm Topologies 17 negatives. Hence we can extrapolate the number of real connections in the swarm when we began the scan (red line). In this particular swarm the ratio of revoked connections after 1000 seconds was 0.563. By assuming a linear amount of good connections being revoked by retesting we end up with a theoretical false positive rate of 0.505 at 400 seconds, of which our scan detected 0.479. This leads us to the conclusion that 5.15% of the connections we detected in this particular swarm are false positives. For smaller swarms the revocation rate will converge faster because most of the time will be spent on retesting positive results. Chapter 3 Improving Swarm Topologies In the final part of this thesis we try to improve the swarm topology by suggesting geographically closer neighbors. To suggest the neighbors we use the PEX mechanism introduced by the Azureus Messaging Protocol extension and the LibTorrent Protocol extension. 3.1 Geographical distance In the absence of other mechanisms to estimate the locality of a possible connection we have to rely on the geographical distance. Both Autonomous system (AS) path length and link hops have been extensively used throughout many studies on swarm locality. Link hops or routing hops are a measure of the physical route a packet takes as each router on the path will be a hop. AS path length on the other hand directly relates to the number of ISPs involved in the routing of a packet. Only the two endpoints of a potential connection may measure the AS path length or link hops, it is not possible to measure them from outside. If there is a direct correlation between the geographical distance and the number of AS path length or link hops we would be able to estimate the distance of two endpoints without involving them directly and more importantly without any active measurements, such as tracerouts, pings or latency measurements. We could estimate the locality of a connection offline. While it seems intuitive that there is be a correlation between the geographical distance and the routing distance, we have to prove it first. To prove that there is such correlation we extracted the IPs of the peers from several swarms and ran traceroutes from several PlanetLab nodes. From the traceroutes we were able to identify the number of link hops and Autonomous System (AS) hops. We measured 559 distinct IP addresses from 100 PlanetLab nodes. Dropping routes that did not produce meaningful results because we could not identify one of the link hops we get 21330 distinct measurements. Figure 3.1 plots the relation of AS and link hops to the geographical distance. To calculate the distance 18 3. Improving Swarm Topologies 19 20000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 4 6 8 10 12 14 16 18 20 22 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 24 26 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 28 Routing hops ● 0 ● ● ● ● ● ● ● ● ● ● 15000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10000 ● ● ● ● ● ● ● ● ● ● Geographical Distance [km] 15000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Geographical distance vs. AS path length 5000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5000 Geographical Distance [km] 20000 Geographical distance vs. routing hops 30 1 2 3 4 5 6 7 8 9 10 12 14 16 AS Path length Figure 3.1: Comparison of geographical distance to AS hops and routing hops we used the MaxMind IP Geolocation library to determine the approximate coordinate of the IP and used the Haversine formula. There is indeed a medium correlation between the geographical distance and the AS path length, with a Pearson correlation of 0.351, and between the geographical distance and the number of link hops, with a Pearson correlation of 0.417. The geographic distance is a good first approximation of the locality as measured with other metrics such as AS path length or link hops. 3.2 Are BitTorrent swarms locality-aware? Although many proposals have been made over the years on how to improve the locality of BitTorrent swarms none have been widely adopted. Using the method described in Section 2.1 we can verify there is no large scale use of topology improving strategies. To prove this we compare the average distance between neighbors, called connection length, to the expected length of a randomly picked connection. We would expect that: locality ratio = connection length ≈1 E[connection length] The connection length is a discrete random variable. If a peer randomly choses a neighbor from the set of known peers then the connection length is expected to be the average distance to all the possible neighbors. Since each peer choses connections independently we can assume that the average length of all the connections in the final topology is also a random variable with expected 3. Improving Swarm Topologies 20 value equal to the average length of all connections in the complete graph. We reuse the geographical data collected to create the maps and calculate the distance between each peer in the network. We then create the weighted undirected graph depicted in Section 2.4, with the peers as vertices and the connections as edges. The geographical distance, or connection length, is used as edge weight. 0.2 0.4 Percent 0.6 0.8 1.0 CDF of locality ratios 0.0 ● 0.6 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 1.2 1.4 Locality ratio Figure 3.2: Ratio of actual connection length to expected connection length If there is any large scale use of locality improving strategies it should have considerably lowered the average AS path length or number of link hops and therefore indirectly the geographic distance of the endpoints of the connections in the swarm and the locality ratio should be lower than 1. The cumulative distribution function in Figure 3.2 however shows that the average connection length is close to the expected connection length. The average ratio of actual connection length to expected connection length is 1.0250 with a standard deviation of 0.0579 which leads us to the conclusion that the connections are indeed randomly chosen. 3. Improving Swarm Topologies 3.3 21 Why PEX? Many different methods for improving the swarm locality have been proposed over the years. Most of the proposed mechanisms involve improvements at the clients based upon measurements such as traceroutes or round trip times, in order to minimize the number of network hops or autonomous system (AS) path length. As our measurements indicate none of the approaches have actually been implemented at a large scale, mainly due to the missing incentives at the user level to do so. There is no direct advantage for the end user to create a localityaware swarm. During the remainder of this Chapter we will therefore assume the viewpoint of an ISP. The ISP is interested in reducing its peering cost and bandwidth consumption by improving the locality of a swarm, but it does not have access to the software at the peers and may only influence the topology indirectly. Other studies have attempted an ISP centric approach where the ISP intercepts and modifies communication between peers or the communication between a peer and the tracker. Options for an ISP to influence a swarms topology include: • Tracker interception as proposed by Bindal et al. in [1]. • BitTorrent blocking or throttling: Dischinger et al. prove in [3] that some ISPs (most notably Comcast) started throttling or blocking BitTorrent, causing the creation of uTP and the introduction of protocol obfuscation to lower the congestion and evade throttling respectively. • Strategic placing of local peers: Tian et al. in [13] propose that the ISP should actively participate in the sharing of the content and therefore provide a preferred peer for the peers in his network. • Suggesting neighbors via PEX: use a large view of the network and suggest geographically close neighbors by connecting to the peers and sending a PEX message containing the suggestions. • Impersonation: similar to our scanning method from the first part the ISP would connect to its local peers authenticating with the peer id of a remote peer. The goal is to avoid the local peer to contact the remote peer due to a peer id collision. While this works for small swarms and only a few peers in the ISPs network it involves a large overhead because the connections to the local peer cannot be closed. Impersonation, traffic shaping and any interception attempts is likely to be frowned upon by the BitTorrent community, since it imposes the will of the ISP to 3. Improving Swarm Topologies 22 the end user. Strategic placing of ISP owned peers might be a good solution but it binds a lot of resources at the ISP since it would have actively participate in a multitude of swarms, and it would make him vulnerable to Intellectual Property infringement, if he participates in swarms distributing copyrighted material. We see suggesting better peers via PEX as the only acceptable way of influencing the swarm topology because it does not force the peer, it simply broadens the view it has of the swarm. A broader view of the swarm is important even if the peer already implements a locality improving strategy. We can suggest multiple peers at once, but the protocol limits us to 50 for each PEX message. The effectiveness of our suggestions depends on how many other peers the peer we are suggesting to already knows. Assuming the peer establishes connections to peers in its list of known peers at random, the probability that it picks one that we suggested is inverse proportional to the number of peers it knows. 3.4 Implementation with PEX Our implementation uses PEX messages to suggest geographically close by peers. For each suggestion we open a new connection, complete the handshake, send the PEX message with our suggested peers and close the connection after a short timeout. If the peer does not support either of the two PEX implementations we close the connection immediately. We decided against keeping connections open to the peers to avoid the overhead of keeping connections alive and we do not want to occupy a connection from the connection pool at the peers. Additionally bigger swarms would require us to distribute the task of keeping connections open. A first implementation used a one-shot approach to improve the swarm topology. The idea was to perform a scan, attempt to improve the topology and perform another scan to verify that the attempt was successful. This implementation uses the log of a previous scan of the swarm to discover peers and calculates the suggestions offline and only then contacts the peers to deliver the suggestions. The one-shot implementation suffers from two drawbacks: it makes suggestions based on stale information, namely the peer information from a previous scan, and more importantly it attempts to suggest better neighbors to peers that have been in the swarm for some time, know quite a few other peers and have filled a good portion of their connection pool. These peers are therefore unlikely to follow our suggestions. While testing this implementation we were unable to detect any effect. The second implementation is called continuous suggester because it uses the result from announcing to the trackers as source for its peers and keeps announcing to discover new peers. The idea of this implementation is that newly 3. Improving Swarm Topologies 23 joined peers are more susceptible to our suggestions because they only know very few peers, most of which will not be reachable (see Section 2.4), increasing the probability that it will choose one of our suggested peers when establishing new connections. Clearly the earlier we discover a new peer the better our chances are that he might act on our suggestion, we therefore reused the hunters from our scanner implementation to consecutively initiate an announce every 15 seconds. The suggester keeps track of the peers that were reachable before and when receiving an announce response it can detect the new peers. When detecting a new peer we look up the geographic coordinates of its IP and query the list of known peers for the 50 geographically closest peers. Once the 50 closest peers to our query point are known we open a connection to the newly joined peer and send it a PEX message containing our suggestions. Besides contacting the newly joined peers we also contact the peers we are suggesting and send them a PEX message containing the newly joined peer, this way we suggest to both endpoints of our suggested connections and can even suggest if the joining peer does not support PEX. Additionally the suggestions to the other peers are used to detect peers that have left the swarm or have become unreachable so that they will not be included in future suggestions. The suggestions to already known peers are delayed for a few seconds. The delay is used to group suggestions going to the same host and reduce the overhead. Most suggestions would otherwise open a new connection for only a single suggested new peer. Currently querying the list of known peers involves calculating the distance between the query point and all peers that the suggester knows, making it slow in large swarms with high churn. The closest neighbor lookup can be sped up by mapping the peers to a space filling curve which could potentially reduce the lookup time from O(n) to O(log(n)). 3.5 Results Our goal is to reduce the average connection length in a swarm by suggesting geographically close neighbors to newly joining peers. This would result in a shift to the left in the cumulative distribution function in Figure 3.2. With the one-shot implementation we would have been able to make a direct comparison between locality ratio of the swarm before the suggestion and after. With the continuous suggester this is no longer possible because we scan completely different swarms. A single experiment consists of 2 torrents downloaded from The Pirate Bay1 and then selecting one of them at random to be influenced by suggesting peers and the other one to be the control case. Our method should be particularly 1 http://www.thepiratebay.se 3. Improving Swarm Topologies 24 effective on swarms that have many newly joining peers that have only few known peers because it increases the chance of one of our suggestions resulting in an actual connection. We therefore preferred torrents that have just been uploaded and swarms that have only just started. We started the continuous suggester on the influenced swarm and after 1 hour we scanned both the influenced and the control swarm. We influenced a total of 20 swarms with 15133 suggestions to newly joined peers. On average the suggestions had a connection length of 867.90 km compared to the expected connection length in the swarms of 6796.50 km. In 408 cases we were even able to suggest all 50 peers in the same AS as the joining peer. 600 400 0 200 Suggestions 800 1000 Histogram of suggested connection lengths 0 1000 2000 3000 4000 5000 Connection length [km] Figure 3.3: Histogram of average suggested connection lengths Out of all peers contacted to receive a suggestion, either because they joined or because they are a close neighbor to a joining peer, 93.50% supported PEX and successfully received the suggestion. By comparing the locality ratios between the influenced swarms and the control swarms we get Figure 3.4. In the influenced swarms we have an average locality ratio of 0.9629836 which is a 6.05% improvement over the uninfluenced swarms. The resulting bandwidth savings are heavily dependent on the swarm, but 3. Improving Swarm Topologies 25 1.0 CDF of locality ratios 0.0 0.2 0.4 Percent 0.6 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.6 0.8 1.0 ● 1.2 1.4 Locality ratio Figure 3.4: Ratio of actual connection length to expected connection length. The influenced swarms (green) have a significantly lower locality ratio than the control swarms (red). assuming that a peer joins the swarm and we are able to successfully suggest a peer that uses the same ISP we can reduce the traffic across the ISPs boundary by 1/50th of the total shared file size, assuming that each connection contributes equally to the download. Each suggestion requires establishing 51 connections with handshakes (68 bytes), followed by a LTEX or AZMP handshake (∼200 bytes) and the PEX message (50 · 6 = 300 bytes for the suggestion to the newly joined peer and 6 bytes for the suggested peers), creating a total overhead of 14 KB for each suggestion. The tracker announces result in an additional traffic of 20byte/s for each swarm, which is amortized over all suggestions. An ISP may infer some of the information by sniffing the traffic on his network. The ISP does not itself have to announce to learn that a peer in his network is joining a swarm and he might collude the tracker responses from multiple announces without having to announce itself. It should be noted that although we chose the geographical distance as a metric for our suggestions our suggestion method is not limited to it. An ISP can chose any suggestion policy to influence the swarm. For example if the ISP owns several widespread autonomous systems or has a favorable peering 3. Improving Swarm Topologies 26 agreement with another provider he might prefer peers from those networks and suggest accordingly. Chapter 4 Conclusion In this thesis we have developed a method to detect the network topology of BitTorrent swarms without any privileged access to the peers, just by knowing their identities and attempting to complete handshakes. We have then analyzed the performance of our method and discussed the quality of the results gathered by it. In the second part we used our method to see if there is a large scale use of locality improving strategies. We first established that the geographical distance between peers is a good approximation of the underlying network topology distance. We then compared the geographical distance of connection endpoints to what is expected if they are chosen at random and showed that indeed the connections are opened at random. Finally we influenced the topology of the swarms by suggesting closer neighbors to newly joining peers and verified that it works with the implementation of our method from the first part. When starting to influence early during the creation of the swarm we can reduce the average connection length considerably. Even in small swarms of 100 peers the sum of distance savings is larger than the distance between earth and the moon. 27 Bibliography [1] R. Bindal, P. Cao, W. Chan, J. Medved, G. Suwala, T. Bates, and A. Zhang. Improving traffic locality in bittorrent via biased neighbor selection. In Distributed Computing Systems, 2006. ICDCS 2006. 26th IEEE International Conference on, pages 66–66. IEEE, 2006. [2] D.R. Choffnes and F.E. Bustamante. Taming the torrent: a practical approach to reducing cross-isp traffic in peer-to-peer systems. In ACM SIGCOMM Computer Communication Review, volume 38, pages 363–374. ACM, 2008. [3] M. Dischinger, A. Mislove, A. Haeberlen, and K.P. Gummadi. Detecting bittorrent blocking. In Proceedings of the 8th ACM SIGCOMM conference on Internet measurement, pages 3–8. ACM, 2008. [4] A.H.T.M.D. Fauzie, A.H. Thamrin, R. Van Meter, and J. Murai. Capturing the dynamic properties of bittorrent overlays. AsiaFI Summer School, 2010. [5] B. Ford, P. Srisuresh, and D. Kegel. Peer-to-peer communication across network address translators. In USENIX Annual Technical Conference, volume 2005, 2005. [6] T. Karagiannis, P. Rodriguez, and K. Papagiannaki. Should internet service providers fear peer-assisted content distribution? In Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement, pages 6–6. USENIX Association, 2005. [7] T. Locher, P. Moor, S. Schmid, and R. Wattenhofer. Free riding in bittorrent is cheap. In Proc. Workshop on Hot Topics in Networks (HotNets), pages 85–90. Citeseer, 2006. [8] P. Maymounkov and D. Mazieres. Kademlia: A peer-to-peer information system based on the xor metric. Peer-to-Peer Systems, pages 53–65, 2002. [9] R.L. Pereira, T. Vazão, and R. Rodrigues. Adaptive search radius-lowering internet p2p file-sharing traffic through self-restraint. In Network Computing and Applications, 2007. NCA 2007. Sixth IEEE International Symposium on, pages 253–256. IEEE, 2007. [10] Shansi Ren, Enhua Tan, Tian Luo, Songqing Chen, Lei Guo, and Xiaodong Zhang. Topbt: A topology-aware and infrastructure-independent bittorrent client. In INFOCOM, 2010 Proceedings IEEE, pages 1 –9, march 2010. 28 Bibliography 29 [11] Klaus Mochalski Schulz, Hendrik. Internet study 2008/2009, 2009. [12] M. Sirivianos, J.H. Park, R. Chen, and X. Yang. Free-riding in bittorrent networks with the large view exploit. In Proc. of IPTPS, volume 7, 2007. [13] C. Tian, X. Liu, H. Jiang, W. Liu, and Y. Wang. Improving bittorrent traffic performance by exploiting geographic locality. In Global Telecommunications Conference, 2008. IEEE GLOBECOM 2008. IEEE, pages 1–5. IEEE, 2008. [14] H. Xie, Y.R. Yang, A. Krishnamurthy, Y.G. Liu, and A. Silberschatz. P4p: provider portal for applications. In ACM SIGCOMM Computer Communication Review, volume 38, pages 351–362. ACM, 2008. Appendix A Working with Planetlab PlanetLab is an academic network specifically aimed at providing realistic conditions for testing network related projects. After registering for an account the user gets access to the control interface which allows adding SSH keys to the allowed list and adding hosts to the users slice. A slice is a collection of virtual machines running on the PlanetLab machines. The virtual machines have a mix of Fedora 8 or Fedora 14 installed, which can cause some incompatibilities due to different software versions. In our case one major problem was the python versions which were either 2.3 or 2.5. The user gets root access via the sudo command, allowing him to install software as needed, but a distribution upgrade in order to homogenize the environment is not possible and results in the virtual machine being unusable. Sudo access does not require a password, but on the machines running Fedora 8 a TTY device is required. While this is not a problem if logging in via SSH and then executing a command it does fail if the command is executed directly like this: s s h − l e t h z p l e B i t T h i e f p l a n e t l a b . h o s t n a . me sudo uptime To get sudo working without a TTY we have to log in once with a TTY and execute the following: sudo s e d − i \ ’ s / e t h z p l e B i t T h i e f \tALL=(ALL) \ tALL/ e t h z p l e B i t T h i e f ALL=(ALL) NOPASSWD: ALL/ g ; ’ \ / etc / sudoers This command sets the NOPASSWD for the user ethzple BitThief which then allows a TTY-less sudo. This is especially important when executing sudo commands via pssh. At first we used cssh1 to open multiple SSH connections and run commands on the PlanetLab nodes. While cssh is good for a small number of hosts we 1 http://sourceforge.net/projects/clusterssh/ A-1 Working with Planetlab A-2 switched to pssh2 once we got beyond 20 hosts because cssh keeps an xterm open for each host it is connected to. pssh executes a single command in parallel. The downside of pssh is that it executes just the single command, does not allow for interaction with the program and the output of the program is shown only after it has been executed and the connection has been closed. cssh is to be preferred for small clusters while pssh is great to execute a single command on a lot of machines in parallel. Planetlab nodes change their SSH identity from time to time, and if they do connections to them will yield an error. If security is not an issue it might be a good idea to simply specify two command line options in order to disable host identity checking −O StrictHostKeyChecking=no and avoid adding PlanetLab node SSH keys to the list of known keys −O UserKnownHostsFile=/dev/null. These two options work both with plain SSH and pssh. screen is also a very versatile tool to detach commands from the shell and re-attach to them later. It allows to run programs in the background but still being able to interact with them if needed. screen does not work out of the box on PlanetLab nodes. The only solution to get screen to work on PlanetLab is to install the an SSH daemon and letting it run on a non-default port: sudo yum i n s t a l l −−nogpg −y openssh−s e r v e r sudo s e d − i ’ s/#Port 22/ Port 6 4 4 4 4 / ’ / e t c / s s h / s s h d c o n f i g sudo / e t c / i n i t . d/ s s hd r e s t a r t This installs the SSH daemon on the virtual machine, reconfigures it to run on port 64444 and restarts it. Besides allowing us to use screen it also eliminates the need for the changing SSH key. 2 http://www.theether.org/pssh/