dfs/bfs seem to have overhead depending on total graph size
Hi all,
I’ve got a very large undirected graph (407 mil vertices, 522 mil edges, 2 vertex properties: int64_t and vector<int64_t>) that consists of multiple connected components (ccs).
I noticed that when I call e.g. dfs_iterator or dfs_search on a source vertex, it takes around 1 – 2 seconds to return. The upper bound is depending on the component’s size, but the lower bound seems to be the same for all components.
I have created a test graph with only a subset of the ccs of the large graph. Iterating the same cc in the test graph takes only a couple of milliseconds instead of ≥ 1 s. This tells me, that the dfs/bfs iterators have some kind of overhead depending on the complete graph size.
I wrote a DFSVisitor that collects some timings during iteration to better see which steps consume time. Here are the results (rounded for readability):
In [30]: test_dfs(graph, graph.vertex(0))
# first time the function is entered
visitor.start_vertex_t 2.3e-05 s
visitor.first_discover_vertex_t 0.18 s
visitor.first_examine_edge_t 0.18 s
visitor.first_tree_edge_t 0.63 s
visitor.first_finish_vertex_t 0.63 s
# average time between last 2 calls of the function
visitor.discover_vertex_t 0.002 s
visitor.examine_edge_t 0.001 s
visitor.tree_edge_t 0.001 s
visitor.finish_vertex_t 0.001 s
# last time finished() is called
visitor.finished 1.3 s
# number of times the functions were called
visitor.discovered_vertices 565
visitor.examined_edges 1978
visitor.tree_edges 564
visitor.finished_vertices 565
took 1.38 s
As you can see, start_vertex is called immediately, but then it takes a very long time until the other Visitor functions are called for the first time after which the calls are faster again, but still quite slow. On the test graph I think I can see the same trend with smaller numbers because the graph is smaller:
In [30]: test_dfs(test_graph, test_graph.vertex(0))
# first time the function is entered
visitor.start_vertex_t 2e-05 s
visitor.first_discover_vertex_t 0.0007 s
visitor.first_examine_edge_t 0.0007 s
visitor.first_tree_edge_t 0.0016 s
visitor.first_finish_vertex_t 0.0015 s
# average time between last 2 calls of the function
visitor.discover_vertex_t 1.6e-05 s
visitor.examine_edge_t 3.7e-06 s
visitor.tree_edge_t 1.4e-05 s
visitor.finish_vertex_t 1.4e-05 s
# last time finished() is called
visitor.finished 0.01 s
# number of times the functions were called
visitor.discovered_vertices 565
visitor.examined_edges 1978
visitor.tree_edges 564
visitor.finished_vertices 565
took 0.01 s
I also noticed that when I remove all vertex properties from the large graph, the minimum runtime of dfs_search is reduced by 0.7 s.
As it is, iterating small ccs in the large graph using pure python is much much faster than using graph-tool. It takes only a couple of milliseconds:
def dfs(graph, vertex_idx):
t = time.time()
todo = {vertex_idx}
visited = set()
while len(todo) > 0:
next_vertex = todo.pop()
visited.add(next_vertex)
todo |= set(graph.iter_all_neighbors(graph.vertex(next_vertex))) - visited
print(f'dfs took {time.time() - t} s')
return visited
In [38]: d = dfs(graph, 0)
dfs took 0.01653146743774414 s
Because of graph-tool’s slowness for small ccs, my current workaround was to write a heuristic that always attempts a naive python dfs first, but aborts if 1 second is exceeded and only then does the graph_tool dfs.
Anyone know what’s going on here?
Best
My-Tien