Graph Algorithms and Distributed memory Parallel Systems
Processing graph algorithms in a distributed memory parallel systems is tricky and requires lots of optimizations to reach an optimal processing time. Graph algorithms usually requires information propagating between the graph vertices in a series of steps or iterations, such as PageRank algorithm, and thus the more graph vertices and edges we have the more messages propagates through the network. In a vertex-centric bulk synchronous parallel graph processor (BSP), such as Pregel, the complexity difference between sending messages to other graph nodes and the actual processing time for a graph node can be too huge to the extend that the main bottleneck for such processing scheme becomes the communication between different machines. This limitation was discussed by Andrew Lumsdaine et al. in “Challenges in Parallel Graph Processing“. Maybe shared memory systems are more suitable but this is out of scope for our current discussion. As a result, any optimization that reduces the network messages between the distributed machines reduces the computation complexity significantly.
Static Graph Partitioning
Using the most common graph partitioning algorithms to generate 3-way subgraphs on the sample graph in Figure 1, range and hash based partitioning generates 10 and 8 edge cuts respectively as shown in Figure 2 and Figure 3. In other words, the network cost per graph iteration, assuming a PageRank algorithm, for range based partitioning is 10 messages while it is 8 messages for the hash based partitioning. One way to reduce the physical network overhead is to reduce the “external” graph edges between two subgraphs that resides in different machines. This is done by finding the minimum cuts for the input graph which results in localizing highly connected graph vertices in a single subgraph if possible. Figure 4 shows the result of applying the minimum cuts algorithm on the graph in Figure 1, which results in 5 edge cuts only. It is worth to mention that finding the minimum cuts of the graph is a costly operation and sometimes cannot be done for large graphs with limited hardware. So smartly partitioning the graph should be able to reduce the cost of end-to-end computations significantly and cover for the partitioning costs, or otherwise it wouldn’t make sense to use it. We used METIS to generate the 3-way partitioning on the sample graph, but you can use any other tools or algorithms on the web for finding the minimal cuts in a graph. Please refer to “Algorithms and Software for Partitioning Graphs” for more details. In our technical report “Mizan: Optimizing Graph Mining in Large Parallel Systems” we discuss in details the costs of partitioning with respect to the graph structure.
Static and Dynamic Graph Algorithms
Static graph algorithms are usually the algorithms represented mathematically by a matrix vector multiplication such as PageRank, Random Walks and Diameter Estimation. Such algorithms always have a fixed behavior across the multiple iterations of the algorithm without any surprises to the graph processor. For example, each vertex at each iteration of PageRank algorithm gets ranks from the vertex’s incoming edges and sends the newly calculated rank to all of its outgoing edges. The complexity of static algorithms is directly related to the messages propagation through the physical network which means that minimizing the graph cuts using static partitioning improves the performance of the algorithm. So in other words, the static algorithms benefit from smartly partitioning the graph and localizing the edges of the graph. However, this is not the case for dynamic graph algorithms.
Dynamic graph algorithms has a variable behavior across the algorithm iterations, where the amount of incoming messages, the outgoing messages or even the massage size differs depending on the state of the vertex. Example algorithms are: Distributed Minimal Spanning tree, Advertisement Propagation Simulation on Graphs and Finding the Maximal vertex value. Such algorithms leads to the problem that the workload of each vertex does not depends directly on the graph edges, or any other static feature of the graph, which is only determined during the algorithm runtime. Moreover, the dynamic behavior of a graph algorithm leads to an unbalanced graph processing system which causes performance degradations to the end-to-end processing, for example some workers can be overloaded with incoming/outgoing messages while others are idle. In this case starting the computation with a static smart partitioning does not necessary leads to a balanced or optimized computation; the system should be able to adapt for the behavioral change of the graph algorithm to avoid system imbalance and improve the response time of the system.