'NetworkX average shortest path length and diameter is taking forever

I have a graph (A) built from unweighted edges, and I would like to compute the average shortest path length for the biggest connected graph (giantC) in my main graph (A). However, the script has been running for more than 3 hours so far (tried on Colab and locally), and no results are output neither for diameter nor for average_shortest_path_length.

I am using networkx==2.5, python==3.6.9

and here is my script

import logging
import networkx as nx 
from networkx.algorithms.distance_measures import diameter
from networkx.algorithms.shortest_paths.generic import average_shortest_path_length


# graph is built from a json file as follows 
with open('graph.json') as f:
     graph_dict = json.load(f)

_indices = graph_dict['indices']
s_lst, rs_lst= _indices[0], _indices[1]    

graph_ = nx.Graph()
for i in range(len(s_lst)):
     graph_.add_edge(s_lst[i], rs_lst[i])


# fetch the hugest graph of all graphs
connected_subgraphs = [graph_.subgraph(cc) for cc in 
nx.connected_components(graph_)]
logging.info('connected subgraphs fetched.')
Gcc = max(nx.connected_components(graph_), key=len)
giantC = graph_.subgraph(Gcc)
logging.info('Fetched Giant Subgraph')

n_nodes = giantC.number_of_nodes()
print(f'Number of nodes: {n_nodes}') # output is 106088

avg_shortest_path = average_shortest_path_length(giantC)
print(f'Avg Shortest path len: {avg_shortest_path}')

dia = diameter(giantC)
print(f'Diameter: {dia}')

Is there any way to make it faster? or an alternative to computing both the diameter and shortest path length for the giantC graph?



Solution 1:[1]

For future readers, if you want to fetch the largest connected subgraph from your NetworkX Graph

import networkx as nx
import logging


def fetch_hugest_subgraph(graph_):
    Gcc = max(nx.connected_components(graph_), key=len)
    giantC = graph_.subgraph(Gcc)
    logging.info('Fetched Giant Subgraph')
    return giantC

If you want to compute the average shortest path length for your graph we can do that by sampling

from statistics import mean
import networkx as nx
import random


def write_nodes_number_and_shortest_paths(graph_, n_samples=10_000,
                                          output_path='graph_info_output.txt'):
    with open(output_path, encoding='utf-8', mode='w+') as f:
        for component in nx.connected_components(graph_):
            component_ = graph_.subgraph(component)
            nodes = component_.nodes()
            lengths = []
            for _ in range(n_samples):
                n1, n2 = random.choices(list(nodes), k=2)
                length = nx.shortest_path_length(component_, source=n1, target=n2)
                lengths.append(length)
            f.write(f'Nodes num: {len(nodes)}, shortest path mean: {mean(lengths)} \n')

Computing avg_shortest_path_length as I was informed by Joris Kinable (in the comments) has the complexity of O(V^3); V = number of nodes. The same applies for computing the diameter of your graph.

Solution 2:[2]

For future readers. In NetworkX >= 2.6 is available a diameter approximated metric for both directed and undirected graphs.

Example:

>>> import timeit
>>> timeit.timeit("print(nx.diameter(g))",setup="import networkx as nx; g = nx.fast_gnp_random_graph(5000, 0.03, 100)", number=1)
3
224.9891120430002
>>> timeit.timeit("print(nx.approximation.diameter(g))",setup="import networkx as nx; g = nx.fast_gnp_random_graph(5000, 0.03, 100)", number=1)
3
0.09284040399961668

Note that the approximated metric will compute a lower bound with the respect to the exact value.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 abc