Large-Scale Graph Analysis

Algorithm I: Clever Graph Sampling

The goal is to sample nodes and edges from an input graph to generate a smaller graph that retains the most connected components in the original graph. A connected component is a subgraph in which each pair of nodes is connected to each other via a path.

This sampling method is based on random walks. We first select a small set of nodes as starting nodes and simulate random walks on the graph starting from those nodes. In every iteration, a small number of nodes that were already sampled are selected and added as new roots for additional random walks.

Algorithm and Program Analysis

There are two main parameters that can affect the program performance: root fraction and number of partitions.
The root fraction is the percentage of graph nodes that are utilized as roots for the random walks. Below are the results of varying this parameter while keeping all others constant for obtaining a sample that has 40% of the edges of the original graph.

Root Fraction	Running Time (sec)	Iterations	Connected Components in Sample
15%	396	48	17,866
10%	586	73	11,660
5%	1,102	146	5,942

The program runs faster when the root fraction is larger, but the number of components found in the sample is larger and could potentially contains more broken components.

The chart below show the results of varying the number of partitions.

The program ran fastest when the number of partitions was set to 30. Increasing the number of partitions allows more parallelism but the communication cost when connecting edges across different partitions after each iterations slows down the running time.

This graph sampling algorithm based on random walks is useful for generating samples that preserve graph connectedness. By adjusting the root fraction parameter, we can tradeoff between speed and level of graph connectedness in the output sample.

Design borrowed from John Barron's website.

Large-Scale Graph Data Analysis Algorithms
CS 6240 Large-Scale Parallel Data Processing

Code

Algorithm I: Clever Graph Sampling

Algorithm and Program Analysis

Large-Scale Graph Data Analysis Algorithms CS 6240 Large-Scale Parallel Data Processing

Code

Algorithm I: Clever Graph Sampling

Algorithm and Program Analysis

Large-Scale Graph Data Analysis Algorithms
CS 6240 Large-Scale Parallel Data Processing