Computing and Finding a Minimum Bottleneck
Transcription
Computing and Finding a Minimum Bottleneck
Computing and Finding a Minimum Bottleneck Spanning Tree in parallel Ahmad Traboulsi School of Computer Science Carleton University Ottawa, Canada K1S 5B6 [email protected] April 19, 2015 Abstract Finding a minimum bottleneck spanning tree consist basically of finding the minimum bottleneck edge. In the paper I have parallelized an approach to find that minimum bottleneck spanning tree edge. The parallelization is on both levels Cluster and Cluster node level (CGM cluster), the algorithm is presented along with evaluation and results. The approach used came from the reverse-delete algorithm. 1 Introduction The world of computing lead people to the thinking of optimizing all the solutions, i.e. finding the best solution to a problem from the set of all feasible solutions. Many such problems require lots of processing time and power especially with problems consisting of processing big data and sometimes its impractical to do that using serial computing due to the limitation of hardware, however nowadays parallel computers with different architectures (multicore, GPUs, Clusters) are available which allows better performance and efficiency if the hardware was utilized well using parallel programming. One of the classical graph optimization problems is Minimum Spanning Trees(MST) and another related problem this paper will be addressing and considering is the Minimum Bottleneck Spanning Tree Problem (MBST). Formally the Minimum Spanning Tree problem is defined as follow, let G = (V, E) be an undirected connected graph with a cost function w mapping edges to positive real numbers. A spanning tree is an undirected tree connecting all vertices of G. The cost of a spanning tree is equal to the sum of the costs of the edges in the tree. A minimum spanning tree is a spanning tree whose cost is minimum over all possible spanning trees of G. The Minimum Bottleneck Spanning Tree problem is defined as follow, let G = (V, E) be an undirected connected graph with a cost function w mapping edges to positive real numbers. The bottleneck edge of a spanning tree is the edge with the maximum cost among all edges of that tree, there might be more than one bottleneck edge in a spanning tree in which they all have the same cost. A spanning tree T is called a minimum bottleneck spanning tree (MBST) if its bottleneck edge cost is minimum among all possible spanning 1 trees. Some of the applications of MBST and MST ([15]) are: • Taxonomy • Cluster analysis: clustering points in the plane, single-linkage clustering (a method of hierarchical clustering), graph-theoretic clustering, and clustering gene expression data. • Constructing trees for broadcasting in computer networks. On Ethernet networks this is accomplished by means of the Spanning tree protocol. • Image registration and segmentation • Curvilinear feature extraction in computer vision. • Handwriting recognition of mathematical expressions. • Circuit design: implementing efficient multiple constant multiplications, as used in finite impulse response filters. • Regionalisation of socio-geographic areas, the grouping of areas into homogeneous, contiguous regions. • Comparing ecotoxicology data. • Topological observability in power systems. • Measuring homogeneity of two-dimensional materials. • Minimax process control. In addition to that the MBST and MST are often a key module for solving more complex graph algorithms. I’m planning to present an approach to use concurrent threads on the Reverse-Delete algorithm to solve the MBST problem. 2 2 Literature Review The literature has no parallel algorithms or implementations for the Minimum Bottleneck Spanning Tree problem, however for the MBST sequential algorithms I referred to [2] paper by Camerini presents one algorithm for finding minimum bottleneck spanning tree in a weighted undirected graph and another for finding a minimum bottleneck spanning tree in a directed graph, a second paper that solves the MBST in a directed graph is by Harold and Tarjan [5] which again presents a new algorithm for finding an MBST in a directed graph and a second algorithm a modified Dijkistra algorithm that finds an MBST in a directed Graph. Most of those algorithms are inherently sequential therefore not a good choice to parallelize, however in Kruskal’s algorithm original paper it includes an algorithm called reverse-delete which can be utilized to get an MST or an MBST. On the other hand there are lots of papers about finding the minimum spanning trees in a parallel computing systems with different architectures. However the most targeted algorithm in MST among the three well known algorithms Kruskal’s algorithm, Prim’s algorithm and Boruvka’s algorithms is the latter because of the fact that it is naturally parallel ,whilst Kruskal’s and prim’s algorithms are inherently sequential which makes it difficult to parallelize, and as I have observed non of the papers that targeted those algorithms as their approach without the combination or modification has little to no improvements and speed ups. In the upcoming I’ve divided my reviews to several subsections according to the parallel architecture used. 2.1 MST - GPUs Three papers were reviewed in this architecture starting from the current state-of-the-art approach by Vineet et Al. [13] which gives speedup of 30 to 50 times over cpu implementation and in under one second on one quarter of Tesla S1070 GPU an MST is constructed for a graph with 5 million node and 30 million edge, their algorithm is based on Boruvka’s algorithm that uses scalable primitives such as scan, split and segmented scan in addition to efficient data mapping primitives including sort scan and reduce, its basically a recursive approach that uses a series of basic primitives at each step. A second paper by S.Bressan et Al. [10] claims to outperform the current state of art algorithm by Vineet et al, in which their approach is based on Prim’s algorithm that also uses the parallel primitives namely prefix-sum, stream compaction and sorting as intermediate components of their algorithms, the algorithm let each processor to try to grow a tree using prim’s algorithm whenever a collision between two trees occur then one of the processors hands over its tree to the other and start building a tree from a new unvisited vertex the idea is somehow similar to an approach in multicore transactional memory approach to find an MST by S.Kang and D.Bader [7]. Finally the third paper by W.Wang et Al. [14]similarly uses prim’s algorithm however by not parallelizing the outer loop the algorithm does not performs well when compared to the two previously mentioned papers since it doesn’t try to grow multiple trees and only tries to parallelize the two inner loops finding min weight edge and updating the candidate edge set. 2.2 MST - Multicores A fast shared memory algorithm [1] by David A. Bader and Guiojing Cong for computing an MST in Sparse Graph gives three variants of Boruvka’s Algorithm plus a new MST algorithm 3 for Symmetric Multiprocessors (SMP). Their best variant of Boruvka’s algorithm has a time complexity of O( m+n p .log(n)) and their new MST algorithm has a time complexity in the worst case similarly O( m+n p .log(n)) however their new MST is interesting since it combines Prim’s and Boruvka’s algorithm in a way that the processors starts growing trees as in Prim’s algorithm then contract them as in Boruvka’s finally repeating the whole procedure again recursively but the lock free mechanism used has an excessive overhead. The second paper by David A Bader and S.Kang [7] which is based on the latter algorithm of the previous paper the new MST algorithm. This paper also target sparse graphs and provides a speedup of an average 8 up to 16 times, the algorithm grow trees using Prim’s algorithm and a processor stops growing the tree when it touches another tree, in this case processor 1 hands over its tree to processor 2 and starts all over again from a new unvisited node, the drawback of the algorithm presented is it decreases the utilization of the number of processors therefore less parallelism. The third paper in this section by A. Katsigiannis et al. [8] tries to parallelize kruskals algorithm using helper threads, a main thread proceeds as the usual Kruskals algorithm while the helping threads trying to decrease the search space of the main thread. this is done by assigning to each of those helping threads a partition of the list of edges, and each processor keeps looping through its partition tying to test each edge if it would cause a cycle with the current found MST edges by the main thread, as soon as the main thread enters a helper thread partition the helper thread stops, they mentioned that a speed up of 5.5 times of the sequential kruskal’s algorithm. The drawback in this algorithm is that again the utilization of threads and processes decrease as the main thread approaches. So current state-of-the-art is the 2nd paper presented in this section for multicores, however this algorithm requires a costly inter-processor communication to merge subtrees when they do get in contact. 2.3 MST - Clusters Parallelization of both Prim’s and Kruskal’s algorithms are presented by V.Loncar et al. [12] in which a master slave approach in parallelization of Prim’s algorithm my having several processes find the min weight edge in their set of edges and vertices and finally collecting 2 the data and processing the results. This algorithm runs in O( np . + O(nlog(p)) and a parallelization of Kruskal’s algorithm is also presented which works as follow partitions of the main graph are assigned to the processors each locally computing the MST using Kruskal’s 2 algorithm and merging them, the time complexity of this algorithm is O( np . + O(n2 log(p)). The second paper is based on MapReduce in which an approach of how to achieve a very simple Java implementation of Minimum Spanning Tree problem in MapReduce [11]. It only gives the implementation details no analysis was provided. Basically uses Kruskal’s algorithm as a reducer after partitioning the graph into subgraphs. 2.4 MST - abstract Machines Two abstract machines were considered in the two papers F.Dehne and S.Gotz [6] presenting an algorithm Boruvka’s based that computes the MST by finding local MST by each processing unit then prunes and merges the resulting MSTs into a single one using D-ary tree on a BSP abstract computer. The second paper is by K.W. Chong et al. [4] is an optimal time logarithmic time O(logn) on PRAM EREW abstract computer. It takes log(n) steps by using multiple threads working on different parts of the search space however as soon as one thread i finishes the following thread requires only an O(1) to finish and there 4 are log(n) threads therefore resulting in a time complexity of the order O(long), therefore being the state-of-the-art. 2.5 MST - architecture Independent methodology One paper presenting an independent platform algorithm C. da Silva Sousa et al. [9] a variant of Boruvka’s algorithm. The implementation is based on a specific design and implementation decisions such as data representation. Claims to outperform all other existing algorithms, however no results were shown that compares it to the state-of-the-art algorithms stated previously. The implementation and the approach taken are interesting, and from the implementation Its obvious to state that it would perform best at a GPU architecture. The above were material related to the Minimum Spanning Trees problem however the problem I’m resolving has not yet been touched in parallel computing therefore I will be reading more papers and doing more literature review regarding my approach to the Minimum Bottleneck Spanning trees and mainly I need to find more about parallel algorithms for connected components, since its a part of my approach to parallelizing the reverse-delete algorithm. 3 Project Report In this project I try to parallelize one approach inspired from reverse-delete algorithm for computing a minimum bottleneck spanning tree on two levels, both cluster level and cluster node level . The remaining of this report is as follow, section 3.1 defines the MBST in more details. In 3.2 I present the approach of computing a minimum bottleneck spanning tree, in 3.3 the approach of parallelizing the algorithm with subsections 3.3.1 about the parallelization on cluster level, and subsection 3.3.2 on parallelization on cluster node level. Finally section 3.4 evaluation and results are illustrated in figures. 3.1 Minimum Bottleneck Spanning Trees Let G = (V, E) be an undirected connected graph with a cost function w mapping edges to positive real numbers. A spanning tree is a tree connecting all vertices of G. The bottleneck edge of a spanning tree is the edge with the highest cost among all edges of that tree, there might be more than one bottleneck edge in a spanning tree in which they all have the same cost. A spanning tree T is called a minimum bottleneck spanning tree (MBST) if its bottleneck edge cost is minimum among all possible spanning trees. It is easy to see that a graph may have many MBSTs ( e.g. consider a graph where all edges’ costs are the same, then all the spanning trees of that graph have same bottleneck edge cost and 6 ∃ spanning tree with a bottleneck edge cost lower than any other spanning tree , therefore any spanning tree of such graph is a MBST. The well known problem Minimum Spanning Tree (MST) is related to MBST in which the Former is necessary an MBST while the opposite is not true. Therefore any algorithm that get an MST is also an algorithm to get an MBST. 5 3.2 Reverse-Delete inspired Approach for Computing an MBST The Reverse-Delete is an algorithm which is the exact reverse of Kruskals algorithm. The algorithm sorts the edges in non-decreasing order, then starts removing edges starting from the edge with maximum weight at index m (see figure 1), if removal of any edge cause the graph to be disconnected the edge is kept, and the algorithm proceed checking till the edge at index 1. (a) (b) Figure 1: Reverse Delete for computing an MST. To Compute an MBST it is possible to do the same as in reverse-delete however the algorithm stops at the first edge (see figure 2) that disconnect the graph, adds that edge again to the graph and finally get any spanning tree of those remaining edges which will be an MBST. To do this more efficiently one would go searching for that first edge that disconnects the graph by applying a binary search like technique to find that edge. (a) (b) Figure 2: computing an MBST. 6 3.3 Parallelizing the computation of MBST This aforementioned approach to compute an MBST can be parallelized on a cluster with two level parallelization, cluster level and cluster nodes level. The search for the edge is parallelized on the cluster level, while some computations mainly the connectivity check is parallelized on cluster node level. The algorithm is presented below. Algorithm 1 Parallel Computation of an MBST of Graph G 1: Sort the set E of edges in a non decreasing order 2: while bottleneck edge is not found do 3: F indOut(); 4: result= P BF S(); 5: Share and collect the results and the last edge index using Allgather. 6: analyse(CollectedDataf romAllgather); 7: end while 8: Find a spanning tree from the set of edges that has a weight ≤ bottleckneck edge weight Where each of the functions above does the following: F indOut() is a function that allow each cluster node to know the set of edges it is allowed to use at the current round according to its Rank. P BF S() PBFS performs a parallel breadth first search and returns 1 if the graph is connected or zero if its not connected analyse() is a function that analyze the collected data and updates the two variables max disconnected edge and min connected edge which are used in F indOut() and if the bottleneck edge is found it breaks the while loop. 3.3.1 Parallelization on Cluster level On cluster level each cluster node is assigned set of edges as shown in figure 3, performs local computations including connectivity check and then share results with other cluster nodes by participating in the filling of array L in figure 4. The array size is of the size of cluster nodes, the array is expected to have zeros then ones or all zeros or all ones which reflects the case that if a cluster node pi finds the graph connected for the set of edges Ki then processor pi+1 will also find the graph connected for the set Ki+1 since the set Ki ⊂ Ki+1 and the other way around if pi+1 has found the graph disconnected similarly pi will find the graph disconnected. After array L is filled and shared among all processors using Allgather it is analyzed by each of the cluster nodes, in which in the analysis the algorithm updates the its knowledge on the index of the edge with the minimum edge weight that is required to keep the graph connected, and the index of the edge with maximum edge weight that disconnects the graph. Maintaining these two indices will help in finding the bottleneck edge. Its is found once the difference between these two indices is one i.e. if max edge index disconnected is i the min edge index connected is i+1 therefore the edge at that latter is the edge with the minimum bottleneck edge weight. Each cluster node will be deleting and add adding edges at each round, and since the graph is represented in a compressed adjacency list, deletion of an edge would cost 7 (a) (b) (c) Figure 3: three rounds of the algorithm Figure 4: Array L of size cluster nodes O(deg(v)) which would cause a huge overhead. However the deletion is done using an array of edge status rather than deleting them from the compressed adjacency list, and since each processor will be either adding or deleting edges at each round but never both at same round, it can be easily seen that the number of deletion and addition operations are in O( pm2 ). Let p be the number of cluster nodes, S be the size of the region, R be the number of regions and M x be the current search space. R = p + 1, S = M x/R, and M x = m/pi where m is the number of edges and i is the round number. Max number of operations (either deletion or addition) at any round os (p − 1) ∗ S. The total moves would be as follow : log(m) P i=0 mp (p+1)pi = mp1−log(m) (plog(m)+1 −1) (p−1)(p+1) And since log is of base p this is m (p−1)(p+1) which is O( pm2 ) Figure 5: Graph representation along with Sorted Edges and Edge Status array 8 3.3.2 Parallelization on Cluster node level The parallelization on Cluster node level is mainly done on connectivity check which uses BFS. The Parallelization here is on multicore processors nodes. There are several parallel bfs versions and implementations the one considered here is Parallel BFS (PBFS) from C.Leiserson and T.Schardl [3]. According to this paper the parallelization works on the BFS tree levels. A bfs queue would hold only nodes from two distinct levels, knowing that we can replace the queue with a data structure called bag, essentially two Bags are required, In-Bag holds the nodes of level i and Out-Bag holds the nodes of level i+1 (see figure 6). The nodes in the In-Bag are processed in parallel and their output goes in the Out-Bag. However to do this in parallel and efficiently the In-Bag is split for smaller InBags and for each In-Bag there will be an Out-Bag (see figure 7) which are later merged to be the In-Bag of the new round and the algorithm repeats. This results in a benign race when two processor are processing the same neighbour of two distinct nodes in which the same node will be added to two different Out-Bags but that doesn’t affect the correctness of the algorithm but causes some extra work. This race can be solved by using locking methods however according to the results in [3] it shows by using the locking technique the performance actually got worse. (a) Figure 6: Two bags each for a distinct layer (a) Figure 7: Processing the nodes in the In-Bags and put the output in Out-Bags 9 3.4 Evaluation and Results For the evaluation and testing purposes I have implemented a code under the filename CAG.cc in the attached codes folder which create a simple connected Graph, of 1000 node and 800,000 edge. The complexity of the algorithm 1 would be O(p(m + n)(logp (m)m)) while the sequential one would e O((m + n)log2 (m)). The evaluation and test are done as follow: 4 implementations were to be tested however the cluster at the lab has no cilk installed so I had only 3 implementations tested. The implementations tested are: 1. Sequential MBST based on the binary search approach. 2. Parallel MBST with sequential BFS that is the parallelization done on cluster level only. 3. One Cluster Node with Parallel BFS that would be parallelization on cluster node level only, basically one node. The implementation that was not tested was the Parallel MBST with Parallel BFS that is where the parallelization is done on both cluster and cluster node levels. However the code is implemented and only testing was required. The figures below show how did each of the tested implementations performed. (a) Figure 8: Comparison between the time taken for each implementation the yellow bar shows the parallel MBST-BFS with 5 nodes while the green shows the sequential MBST and the blue one is the MBST-PBFS with one node 10 (a) Figure 9: Comparison between the time taken for parallel MBST-BFS with 1,3,4,5 nodes (a) Figure 10: Comparison between the number of rounds for parallel MBST-BFS with 1,3,4,5 nodes 11 4 Conclusion The parallelization on the cluster level had a little improvement in the speed up and this can be clearly seen since the improvement was from log base 2 to log base p where p is number of nodes in the cluster however the parallelization on the cluster node level had a better significant speed up and combining both would absolutely achieve a better speed up, that can be proved with more testing with different number of nodes and edges. Indeed the parallel BFS had a significant better performance since the part that cost the most in the algorithm is the bfs, and parallelizing the BFS had a big impact on the performance. The parallelization on cluster level has an overhead which I worked hard on minimizing it, however parallelization on cluster node level alone had little improvement which in combination with that on cluster node level achieves a better performance. 12 References [1] David A. Badera and Guojing Congb. Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs. Journal of Parallel and Distributed Computing, 66(11):1366–1378, November 2006. [2] P.M. Camerini. The min-max spanning tree problem and some extensions. Information Processing Letters, pages 10–14, January 1978. [3] Tao B. Schardl Charles E. Leiserson. A work-efficient parallel breadth-first search algorithm. SPAA ’10 Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures Pages 303-314, June 2010. [4] Ka Wong Chong, Yijie Han, and Tak Wah Lam. Concurrent threads and optimal parallel minimum spanning trees algorithm concurrent threads and optimal parallel minimum spanning trees algorithm. Journal of ACM, 48(2):297–323, March 2001. [5] Harold N Gabow and Robert E Tarjan. Algorithms for two bottleneck optimization problems. Journal of Algorithms, 9(3):411–417, September 1988. [6] IEEE. Practical parallel algorithms for minimum spanning trees. IEEE, 1998. [7] Seunghwa Kang and David A. Bader. An efficient transactional memory algorithm for computing minimum spanning forest of sparse graphs . Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, February 2009. [8] A. Katsigiannis, N. Anastopoulos, K. Nikas, and N. Koziris. An approach to parallelize kruskal’s algorithm using helper threads. Parallel and Distributed Processing Symposium Workshops and PhD Forum (IPDPSW), 2012 IEEE 26th International, May 2012. [9] Artur Mariano, Cristiano da Silva Sousa, and Alberto Proen¸ca. A generic and highly efficient parallel variant of boruvka’s algorithm. . [10] Sadegh Nobari, Thanh-Tung Cao, Panagiotis Karras, and St´ephane Bressan. Scalable parallel minimum spanning forest computation. In PPoPP ’12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, Principles and Practice of Parallel Programming. ACM, February 2012. [11] Antonio Paolacci. A mapreduce algorithm: How-to approach to the bigdata. . [12] Proceedings of the World Congress on Engineering. Distributed Memory Parallel Algorithms for Minimum Spanning Trees, volume 2, 2013. [13] Vibhav Vineet, Pawan Harish, Suryakant Patidar, and P. J. Narayanan. Fast minimum spanning tree for large graphs on the gpu. HPG ’09 Proceedings of the Conference on High Performance Graphics 2009, August 2009. [14] Wei Wang, Shaozhong Guo, Fan Yang, and Jianxun Chen. Gpu-based fast minimum spanning tree using data parallel primitives. In The 2nd International Conference on Information Engineering and Computer Science. IEEE, December 2010. 13 [15] Wikipedia. https://en.wikipedia.org/wiki/Minimum_spanning_tree. Accessed: 2015-02-12. 14