Large-Scale Graph Data Analysis Algorithms
CS 6240 Large-Scale Parallel Data Processing

I worked on this mini project as part of the Large-Scale Parallel Data Processing course. The aim was to develop a parallalel algorithms in Spark from scratch for efficient sampling from large graphs.

Algorithm I: Clever Graph Sampling

The goal is to sample nodes and edges from an input graph to generate a smaller graph that retains the most connected components in the original graph. A connected component is a subgraph in which each pair of nodes is connected to each other via a path.

This sampling method is based on random walks. We first select a small set of nodes as starting nodes and simulate random walks on the graph starting from those nodes. In every iteration, a small number of nodes that were already sampled are selected and added as new roots for additional random walks.

Algorithm and Program Analysis

There are two main parameters that can affect the program performance: root fraction and number of partitions.
The root fraction is the percentage of graph nodes that are utilized as roots for the random walks. Below are the results of varying this parameter while keeping all others constant for obtaining a sample that has 40% of the edges of the original graph.


Root Fraction Running Time (sec) Iterations Connected Components in Sample
15% 396 48 17,866
10% 586 73 11,660
5% 1,102 146 5,942

The program runs faster when the root fraction is larger, but the number of components found in the sample is larger and could potentially contains more broken components.

The chart below show the results of varying the number of partitions.


The program ran fastest when the number of partitions was set to 30. Increasing the number of partitions allows more parallelism but the communication cost when connecting edges across different partitions after each iterations slows down the running time.

This graph sampling algorithm based on random walks is useful for generating samples that preserve graph connectedness. By adjusting the root fraction parameter, we can tradeoff between speed and level of graph connectedness in the output sample.




Design borrowed from John Barron's website.