_prediction.rst 5.37 KB
Newer Older
 Tiago Peixoto committed Jun 28, 2018 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Edge prediction as binary classification ++++++++++++++++++++++++++++++++++++++++ A more traditional approach to the prediction of missing and spurious edges formulates it as a supervised binary classification task __, where the edge/non-edge scores are computed by fitting a generative model to the observed data, and computing their probabilities under that model [clauset-hierarchical-2008]_ [guimera-missing-2009]_. In this setting, one typically omits any explicit model of the measurement process (hence intrinsically assuming it to be uniform), and as a consequence of the overall setup, only *relative probabilities* between individual missing and spurious edges can be produced, instead of the full posterior distribution considered in the last section. Since this limits the  Tiago Peixoto committed Sep 12, 2018 15 overall network reconstruction, and does not yield confidence  Tiago Peixoto committed Jun 28, 2018 16 17 18 19 20 intervals, it is a less powerful approach. Nevertheless, it is a popular procedure, which can also be performed with graph-tool, as we describe in the following. We set up the classification task by dividing the edges/non-edges into  Tiago Peixoto committed Sep 12, 2018 21 two sets :math:\boldsymbol A and :math:\delta \boldsymbol A, where  Tiago Peixoto committed Jun 28, 2018 22 23 the former corresponds to the observed network and the latter either to the missing or spurious edges. We may compute the posterior of  Tiago Peixoto committed May 26, 2019 24 :math:\delta \boldsymbol A as [valles-catala-consistencies-2018]_  Tiago Peixoto committed Jun 28, 2018 25 26 27 28  .. math:: :label: posterior-missing  Tiago Peixoto committed Sep 12, 2018 29 30  P(\delta \boldsymbol A | \boldsymbol A) \propto \sum_{\boldsymbol b}\frac{P(\boldsymbol A \cup \delta\boldsymbol A| \boldsymbol b)}{P(\boldsymbol A| \boldsymbol b)}P(\boldsymbol b | \boldsymbol A)  Tiago Peixoto committed Jun 28, 2018 31 32 33 34 35 36 37 38  up to a normalization constant [#prediction_posterior]_. Although the normalization constant is difficult to obtain in general (since we need to perform a sum over all possible spurious/missing edges), the numerator of Eq. :eq:posterior-missing can be computed by sampling partitions from the posterior, and then inserting or deleting edges from the graph and computing the new likelihood. This means that we can easily compare alternative predictive hypotheses :math:\{\delta  Tiago Peixoto committed Sep 12, 2018 39 \boldsymbol A_i\} via their likelihood ratios  Tiago Peixoto committed Jun 28, 2018 40 41 42  .. math::  Tiago Peixoto committed Sep 12, 2018 43  \lambda_i = \frac{P(\delta \boldsymbol A_i | \boldsymbol A)}{\sum_j P(\delta \boldsymbol A_j | \boldsymbol A)}  Tiago Peixoto committed Jun 28, 2018 44 45 46  which do not depend on the normalization constant.  Tiago Peixoto committed Sep 12, 2018 47 The values :math:P(\delta \boldsymbol A | \boldsymbol A, \boldsymbol b)  Tiago Peixoto committed Jun 28, 2018 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 can be computed with :meth:~graph_tool.inference.blockmodel.BlockState.get_edges_prob. Hence, we can compute spurious/missing edge probabilities just as if we were collecting marginal distributions when doing model averaging. Below is an example for predicting the two following edges in the football network, using the nested model (for which we need to replace :math:\boldsymbol b by :math:\{\boldsymbol b_l\} in the equations above). .. testcode:: missing-edges :hide: import os try: os.chdir("demos/inference") except FileNotFoundError: pass g = gt.collection.data["football"].copy() color = g.new_vp("string", val="#cccccc") ecolor = g.new_ep("string", val="#cccccc") ewidth = g.new_ep("double", 1) e = g.add_edge(101, 102) ecolor[e] = "#a40000" ewidth[e] = 5 e = g.add_edge(17, 56) ecolor[e] = "#a40000" ewidth[e] = 5 eorder = g.edge_index.copy("int") gt.graph_draw(g, pos=g.vp.pos, vertex_color=color, vertex_fill_color=color, edge_color=ecolor, eorder=eorder, edge_pen_width=ewidth, output="football_missing.svg") .. figure:: football_missing.* :align: center :width: 350px Two non-existing edges in the football network (in red): :math:(101,102) in the middle, and :math:(17,56) in the upper right region of the figure. .. testsetup:: missing-edges gt.seed_rng(7) .. testcode:: missing-edges g = gt.collection.data["football"] missing_edges = [(101, 102), (17, 56)] L = 10 state = gt.minimize_nested_blockmodel_dl(g, deg_corr=True) bs = state.get_bs() # Get hierarchical partition. bs += [np.zeros(1)] * (L - len(bs)) # Augment it to L = 10 with # single-group levels. state = state.copy(bs=bs, sampling=True) probs = ([], []) def collect_edge_probs(s): p1 = s.get_edges_prob([missing_edges[0]], entropy_args=dict(partition_dl=False)) p2 = s.get_edges_prob([missing_edges[1]], entropy_args=dict(partition_dl=False)) probs[0].append(p1) probs[1].append(p2) # Now we collect the probabilities for exactly 100,000 sweeps gt.mcmc_equilibrate(state, force_niter=10000, mcmc_args=dict(niter=10), callback=collect_edge_probs) def get_avg(p): p = np.array(p) pmax = p.max() p -= pmax return pmax + log(exp(p).mean()) p1 = get_avg(probs[0]) p2 = get_avg(probs[1]) p_sum = get_avg([p1, p2]) + log(2) l1 = p1 - p_sum l2 = p2 - p_sum print("likelihood-ratio for %s: %g" % (missing_edges[0], exp(l1))) print("likelihood-ratio for %s: %g" % (missing_edges[1], exp(l2))) .. testoutput:: missing-edges likelihood-ratio for (101, 102): 0.37... likelihood-ratio for (17, 56): 0.62... From which we can conclude that edge :math:(17, 56) is more likely than :math:(101, 102) to be a missing edge. The prediction using the non-nested model can be performed in an entirely analogous fashion.