Is there anything I can help with?
When I calculate the all pairs shortest path distances of a graph using shortest_distance()
and histogram them using np.histogram
, there is a hefty mismatch to distance_histogram()
's output just for larger graphs.
graph-tool
version?graph-tool
manually?Please replace this section with a brief summary of your issue.
gt.__version__ == 2.43 (commit 5778eb10, )
uname -a
Linux hostname 5.11.0-38-generic #42~20.04.1-Ubuntu SMP Tue Sep 28 20:41:07 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
import graph_tool.all as gt
import numpy as np
G = gt.random_graph(300, lambda: np.random.randint(1,10,2))
G_weight = G.new_edge_property('float')
for e in G.edges():
G_weight[e] = 100 * np.random.random()
apsp_true = [ d for l in gt.shortest_distance(G, weights=G_weight) for d in l if 0 < d <= 1e50 ]
counts, bins = gt.distance_histogram(G, G_weight)
counts_np, _ = np.histogram(apsp_true, bins=bins)
print(np.where(counts!=counts_np))
Up to 300 nodes, this returns (array([], dtype=int64),)
as expected. From 301 onward there is a great mismatch:
(array([ 2, 3, 4, 6, 9, 12, 13, 14, 16, 17, 20, 22, 23,
24, 25, 26, 27, 28, 29, 30, 31, 33, 35, 36, 37, 41,
42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 54, 56,
57, 61, 62, 65, 66, 67, 68, 69, 70, 71, 72, 73, 75,
76, 77, 78, 79, 80, 81, 83, 84, 85, 86, 87, 88, 89,
90, 91, 92, 94, 95, 96, 97, 98, 100, 101, 102, 103, 104,
105, 106, 107, 108, 110, 111, 112, 113, 114, 115, 116, 117, 118,
119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131,
132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144,
145, 146, 147, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158,
159, 161, 162, 163, 166, 167, 168, 169, 170, 171, 172, 173, 174,
175, 176, 177, 180, 181, 182, 183, 184, 185, 187, 188, 189, 190,
191, 192, 193, 194, 195, 196, 197, 199, 200, 201, 202, 203, 205,
206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218,
219, 220, 221, 222, 223, 224, 225, 226, 228, 229, 230, 231, 233,
234, 236, 237, 238, 240, 241, 242, 243, 245, 246, 247, 248, 249,
250, 253, 254, 255, 256, 257, 258, 259, 261, 262, 263, 265, 266,
267, 268, 271, 272, 273, 275, 276, 277, 279, 281, 284, 288, 305,
306, 308, 310, 313, 321, 326, 333, 334, 348, 361, 365, 424]),)
For some graphs even sum(counts) != sum(counts_np)
.
What's happening here?
I just have access to the conda-forge version in office so I now compiled the git version (2.45dev (commit 8645b67a, Tue Jan 4 11:23:59 2022 +0100)
) at home and the problem remains unchanged there as well.
graph-tool
version?graph-tool
manually?gt.__version__ == 2.43 (commit 5778eb10, )
uname -a
Linux hostname 5.11.0-38-generic #42~20.04.1-Ubuntu SMP Tue Sep 28 20:41:07 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
import graph_tool.all as gt
import random as rd
G = gt.random_graph(1000, lambda: np.random.randint(1,10,2))
G_weight = G.new_edge_property('float')
G_vp = G.new_vertex_property('bool')
for v in G.vertices():
G_vp[v] = rd.randint(0,1)
for e in G.edges():
G_weight[e] = rd.random()
G.vertex_properties['trigger'] = G_vp
G.edge_properties['weight'] = G_weight
GV = gt.GraphView( G, vfilt=G.vertex_properties['trigger'] )
gt.shortest_distance( GV, weights=GV.edge_properties['weight'] )
Segmentation fault (core dumped)
Might be related to #722
Worse, I just noticed that the output of distance_histogram()
is not constant:
G = gt.random_graph(1000, lambda: np.random.randint(1,10,2))
G_weight = G.new_edge_property('float')
for e in G.edges():
G_weight[e] = 100 * np.random.random()
apsp_true = [ d for l in gt.shortest_distance(G, weights=G_weight) for d in l if 0 < d <= 1e50 ]
apsp_true_mean = sum(apsp_true) / len(apsp_true)
X, Y = range(1000), []
for _ in X:
counts, bins = gt.distance_histogram(G, weight=G_weight)
Y.append(sum(counts * (.5 + bins[:-1])) / sum(counts))
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(X,Y)
ax.hlines(apsp_true_mean, min(X), max(X), ls='--', label='True Mean')
fig.tight_layout()
ax.legend()
ax.set_xlabel('Repetitions')
ax.set_ylabel('Mean of APSP')
fig.savefig('non-constant-apsp.png', bbox_inches='tight')
graph-tool
version?graph-tool
manually?gt.__version__ == 2.43 (commit 5778eb10, )
uname -a
Linux hostname 5.11.0-38-generic #42~20.04.1-Ubuntu SMP Tue Sep 28 20:41:07 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
import graph_tool.all as gt
gt.shortest_path(gt.Graph(), 1, 2)
Segmentation fault (core dumped)
Might be related to #712
When I calculate the all pairs shortest path distances of a graph using shortest_distance()
and histogram them using np.histogram
, there is a hefty mismatch to distance_histogram()
's output just for larger graphs.
graph-tool
version?graph-tool
manually?Please replace this section with a brief summary of your issue.
gt.__version__ == 2.43 (commit 5778eb10, )
uname -a
Linux hostname 5.11.0-38-generic #42~20.04.1-Ubuntu SMP Tue Sep 28 20:41:07 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
import graph_tool.all as gt
import numpy as np
G = gt.random_graph(300, lambda: np.random.randint(1,10,2))
G_weight = G.new_edge_property('float')
for e in G.edges():
G_weight[e] = 100 * np.random.random()
apsp_true = [ d for l in gt.shortest_distance(G, weights=G_weight) for d in l if 0 < d <= 1e50 ]
counts, bins = gt.distance_histogram(G, G_weight)
counts_np, _ = np.histogram(apsp_true, bins=bins)
print(np.where(counts!=counts_np))
Up to 300 nodes, this returns (array([], dtype=int64),)
as expected. From 301 onward there is a great mismatch:
(array([ 2, 3, 4, 6, 9, 12, 13, 14, 16, 17, 20, 22, 23,
24, 25, 26, 27, 28, 29, 30, 31, 33, 35, 36, 37, 41,
42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 54, 56,
57, 61, 62, 65, 66, 67, 68, 69, 70, 71, 72, 73, 75,
76, 77, 78, 79, 80, 81, 83, 84, 85, 86, 87, 88, 89,
90, 91, 92, 94, 95, 96, 97, 98, 100, 101, 102, 103, 104,
105, 106, 107, 108, 110, 111, 112, 113, 114, 115, 116, 117, 118,
119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131,
132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144,
145, 146, 147, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158,
159, 161, 162, 163, 166, 167, 168, 169, 170, 171, 172, 173, 174,
175, 176, 177, 180, 181, 182, 183, 184, 185, 187, 188, 189, 190,
191, 192, 193, 194, 195, 196, 197, 199, 200, 201, 202, 203, 205,
206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218,
219, 220, 221, 222, 223, 224, 225, 226, 228, 229, 230, 231, 233,
234, 236, 237, 238, 240, 241, 242, 243, 245, 246, 247, 248, 249,
250, 253, 254, 255, 256, 257, 258, 259, 261, 262, 263, 265, 266,
267, 268, 271, 272, 273, 275, 276, 277, 279, 281, 284, 288, 305,
306, 308, 310, 313, 321, 326, 333, 334, 348, 361, 365, 424]),)
For some graphs even sum(counts) != sum(counts_np)
.
What's happening here?
For anyone else reading this:
import graph_tool.all as gt
import numpy as np
import math
Let this be the baseline:
APSP = lambda G, weights: sum((l:=[ d for l in gt.shortest_distance(G, weights=weights) for d in l if 0 < d <= 1e50 ]))/len(l)
APSP(G, G_weight)
582.4092142257183
%timeit APSP(G, G_weight)
5.3 s ± 130 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
def subsample_APSP(G, weights=G_weight):
def sample_mean(samples, G=G, weights=weights):
counts, bins = gt.distance_histogram(G, weight=weights, samples=samples)
return sum(counts * (.5 + bins[:-1])) / sum(counts)
N_samples = int( round( G.num_vertices() / 2, 0) )
N = int( round( math.sqrt(N_samples), 0 ) )
M = int( round( N_samples / N, 0 ) )
out = [ sample_mean(M) for _ in range(N) ]
return sum(out) / len(out), np.std(out) / math.sqrt(N)
print("{:.2f} +- {:.2f}".format(*subsample_APSP(G)))
582.55 +- 2.83
%timeit subsample_APSP(G)
205 ms ± 29.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can get the average from the histogram quite easily. BTW that function implements exactly the algorithm I described above, with the bonus that it can run in parallel. So indeed there is nothing else that needs to be implemented.
I now found the time to investigate distance_histogram
and indeed, it works like a charm! Thanks a ton for the hint, I wouldn't have found it.
With half the vertices I estimate the real mean of a full shortest_distance
with less than 0.5 % deviation in less than 5 % the time!
This will not be implemented, since it can be trivially done by the user.
Well, it might be trivially for you as the creator of the package but I don't see a trivial way to do so without writing a big part of Dijkstra or A* myself - otherwise I wouldn't have created this ticket.
I do not know of what "bookeeping" you are referring to that can be avoided.
I mean the heap that is certainly being used when calculating all pairs shortest path. As well as the loop-up what path segments had already been calculated prior.
Well, certainly I can sample random node pairs from the graph and calculate shortest_distance
on each of them but that causes this aforementioned incredible run-time overhead (less than 3 % of the nodes result in a run-time of the full all pairs dijkstra).
If you are only interested in the statistic of distances, you can obtain this with
distance_histogram()
which optionally implements subsampling.
I need the average distance but I'll have a look into how well I can approx. it using distance_histogram()
! Thank you for the hint!
graph_tool.topology.shortest_distance(G, U, V)
, where U
and V
are lists of
same length with sources / targets.
E.g. U = [ u_1, ..., u_n ]
, V = [ v_1, ..., v_n ]
for the tuples (u_1, v_1)
, ..., (v_n, u_n)
In case graph_tool.topology.shortest_distance
takes too long for an all pairs shortest path calculation, the result can be approximated using random sub-samples of (source, target) node pairs.
Manual sub-sampling has no bookkeeping of already calculated paths and hence results in a huge computational overhead.