Edge prediction as binary classification ++++++++++++++++++++++++++++++++++++++++ A more traditional approach to the prediction of missing and spurious edges formulates it as a supervised binary classification task __, where the edge/non-edge scores are computed by fitting a generative model to the observed data, and computing their probabilities under that model [clauset-hierarchical-2008]_ [guimera-missing-2009]_. In this setting, one typically omits any explicit model of the measurement process (hence intrinsically assuming it to be uniform), and as a consequence of the overall setup, only *relative probabilities* between individual missing and spurious edges can be produced, instead of the full posterior distribution considered in the last section. Since this limits the overall network reconstruction, and does not yield confidence intervals, it is a less powerful approach. Nevertheless, it is a popular procedure, which can also be performed with graph-tool, as we describe in the following. We set up the classification task by dividing the edges/non-edges into two sets :math:\boldsymbol A and :math:\delta \boldsymbol A, where the former corresponds to the observed network and the latter either to the missing or spurious edges. We may compute the posterior of :math:\delta \boldsymbol A as [valles-catala-consistencies-2018]_ .. math:: :label: posterior-missing P(\delta \boldsymbol A | \boldsymbol A) \propto \sum_{\boldsymbol b}\frac{P(\boldsymbol A \cup \delta\boldsymbol A| \boldsymbol b)}{P(\boldsymbol A| \boldsymbol b)}P(\boldsymbol b | \boldsymbol A) up to a normalization constant [#prediction_posterior]_. Although the normalization constant is difficult to obtain in general (since we need to perform a sum over all possible spurious/missing edges), the numerator of Eq. :eq:posterior-missing can be computed by sampling partitions from the posterior, and then inserting or deleting edges from the graph and computing the new likelihood. This means that we can easily compare alternative predictive hypotheses :math:\{\delta \boldsymbol A_i\} via their likelihood ratios .. math:: \lambda_i = \frac{P(\delta \boldsymbol A_i | \boldsymbol A)}{\sum_j P(\delta \boldsymbol A_j | \boldsymbol A)} which do not depend on the normalization constant. The values :math:P(\delta \boldsymbol A | \boldsymbol A, \boldsymbol b) can be computed with :meth:~graph_tool.inference.blockmodel.BlockState.get_edges_prob. Hence, we can compute spurious/missing edge probabilities just as if we were collecting marginal distributions when doing model averaging. Below is an example for predicting the two following edges in the football network, using the nested model (for which we need to replace :math:\boldsymbol b by :math:\{\boldsymbol b_l\} in the equations above). .. testcode:: missing-edges :hide: import os try: os.chdir("demos/inference") except FileNotFoundError: pass g = gt.collection.data["football"].copy() color = g.new_vp("string", val="#cccccc") ecolor = g.new_ep("string", val="#cccccc") ewidth = g.new_ep("double", 1) e = g.add_edge(101, 102) ecolor[e] = "#a40000" ewidth[e] = 5 e = g.add_edge(17, 56) ecolor[e] = "#a40000" ewidth[e] = 5 eorder = g.edge_index.copy("int") gt.graph_draw(g, pos=g.vp.pos, vertex_color=color, vertex_fill_color=color, edge_color=ecolor, eorder=eorder, edge_pen_width=ewidth, output="football_missing.svg") .. figure:: football_missing.* :align: center :width: 350px Two non-existing edges in the football network (in red): :math:(101,102) in the middle, and :math:(17,56) in the upper right region of the figure. .. testsetup:: missing-edges gt.seed_rng(7) .. testcode:: missing-edges g = gt.collection.data["football"] missing_edges = [(101, 102), (17, 56)] L = 10 state = gt.minimize_nested_blockmodel_dl(g, deg_corr=True) bs = state.get_bs() # Get hierarchical partition. bs += [np.zeros(1)] * (L - len(bs)) # Augment it to L = 10 with # single-group levels. state = state.copy(bs=bs, sampling=True) probs = ([], []) def collect_edge_probs(s): p1 = s.get_edges_prob([missing_edges[0]], entropy_args=dict(partition_dl=False)) p2 = s.get_edges_prob([missing_edges[1]], entropy_args=dict(partition_dl=False)) probs[0].append(p1) probs[1].append(p2) # Now we collect the probabilities for exactly 100,000 sweeps gt.mcmc_equilibrate(state, force_niter=10000, mcmc_args=dict(niter=10), callback=collect_edge_probs) def get_avg(p): p = np.array(p) pmax = p.max() p -= pmax return pmax + log(exp(p).mean()) p1 = get_avg(probs[0]) p2 = get_avg(probs[1]) p_sum = get_avg([p1, p2]) + log(2) l1 = p1 - p_sum l2 = p2 - p_sum print("likelihood-ratio for %s: %g" % (missing_edges[0], exp(l1))) print("likelihood-ratio for %s: %g" % (missing_edges[1], exp(l2))) .. testoutput:: missing-edges likelihood-ratio for (101, 102): 0.37... likelihood-ratio for (17, 56): 0.62... From which we can conclude that edge :math:(17, 56) is more likely than :math:(101, 102) to be a missing edge. The prediction using the non-nested model can be performed in an entirely analogous fashion.