_prediction.rst 5.37 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Edge prediction as binary classification
++++++++++++++++++++++++++++++++++++++++

A more traditional approach to the prediction of missing and spurious
edges formulates it as a supervised `binary classification task
<https://en.wikipedia.org/wiki/Binary_classification>`__, where the
edge/non-edge scores are computed by fitting a generative model to the
observed data, and computing their probabilities under that model
[clauset-hierarchical-2008]_ [guimera-missing-2009]_. In this setting,
one typically omits any explicit model of the measurement process (hence
intrinsically assuming it to be uniform), and as a consequence of the
overall setup, only *relative probabilities* between individual missing
and spurious edges can be produced, instead of the full posterior
distribution considered in the last section. Since this limits the
15
overall network reconstruction, and does not yield confidence
16 17 18 19 20
intervals, it is a less powerful approach. Nevertheless, it is a popular
procedure, which can also be performed with graph-tool, as we describe
in the following.

We set up the classification task by dividing the edges/non-edges into
21
two sets :math:`\boldsymbol A` and :math:`\delta \boldsymbol A`, where
22 23
the former corresponds to the observed network and the latter either to
the missing or spurious edges. We may compute the posterior of
24
:math:`\delta \boldsymbol A` as [valles-catala-consistencies-2018]_
25 26 27 28

.. math::
   :label: posterior-missing

29 30
   P(\delta \boldsymbol A | \boldsymbol A) \propto
   \sum_{\boldsymbol b}\frac{P(\boldsymbol A \cup \delta\boldsymbol A| \boldsymbol b)}{P(\boldsymbol A| \boldsymbol b)}P(\boldsymbol b | \boldsymbol A)
31 32 33 34 35 36 37 38

up to a normalization constant [#prediction_posterior]_. Although the
normalization constant is difficult to obtain in general (since we need
to perform a sum over all possible spurious/missing edges), the
numerator of Eq. :eq:`posterior-missing` can be computed by sampling
partitions from the posterior, and then inserting or deleting edges from
the graph and computing the new likelihood. This means that we can
easily compare alternative predictive hypotheses :math:`\{\delta
39
\boldsymbol A_i\}` via their likelihood ratios
40 41 42

.. math::

43
   \lambda_i = \frac{P(\delta \boldsymbol A_i | \boldsymbol A)}{\sum_j P(\delta \boldsymbol A_j | \boldsymbol A)}
44 45 46

which do not depend on the normalization constant.

47
The values :math:`P(\delta \boldsymbol A | \boldsymbol A, \boldsymbol b)`
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152
can be computed with
:meth:`~graph_tool.inference.blockmodel.BlockState.get_edges_prob`. Hence, we can
compute spurious/missing edge probabilities just as if we were
collecting marginal distributions when doing model averaging.

Below is an example for predicting the two following edges in the
football network, using the nested model (for which we need to replace
:math:`\boldsymbol b` by :math:`\{\boldsymbol b_l\}` in the equations
above).

.. testcode:: missing-edges
   :hide:

   import os
   try:
      os.chdir("demos/inference")
   except FileNotFoundError:
       pass

   g = gt.collection.data["football"].copy()
   color = g.new_vp("string", val="#cccccc")
   ecolor = g.new_ep("string", val="#cccccc")
   ewidth = g.new_ep("double", 1)
   e = g.add_edge(101, 102)
   ecolor[e] = "#a40000"
   ewidth[e] = 5
   e = g.add_edge(17, 56)
   ecolor[e] = "#a40000"
   ewidth[e] = 5
   eorder = g.edge_index.copy("int")

   gt.graph_draw(g, pos=g.vp.pos, vertex_color=color,
                 vertex_fill_color=color, edge_color=ecolor,
                 eorder=eorder, edge_pen_width=ewidth,
                 output="football_missing.svg")

.. figure:: football_missing.*
   :align: center
   :width: 350px

   Two non-existing edges in the football network (in red):
   :math:`(101,102)` in the middle, and :math:`(17,56)` in the upper
   right region of the figure.

.. testsetup:: missing-edges

   gt.seed_rng(7)

.. testcode:: missing-edges

   g = gt.collection.data["football"]

   missing_edges = [(101, 102), (17, 56)]
   
   L = 10

   state = gt.minimize_nested_blockmodel_dl(g, deg_corr=True)

   bs = state.get_bs()                     # Get hierarchical partition.
   bs += [np.zeros(1)] * (L - len(bs))     # Augment it to L = 10 with
                                           # single-group levels.

   state = state.copy(bs=bs, sampling=True)

   probs = ([], [])

   def collect_edge_probs(s):
       p1 = s.get_edges_prob([missing_edges[0]], entropy_args=dict(partition_dl=False))
       p2 = s.get_edges_prob([missing_edges[1]], entropy_args=dict(partition_dl=False))
       probs[0].append(p1)
       probs[1].append(p2)

   # Now we collect the probabilities for exactly 100,000 sweeps
   gt.mcmc_equilibrate(state, force_niter=10000, mcmc_args=dict(niter=10),
                       callback=collect_edge_probs)


   def get_avg(p):
      p = np.array(p)
      pmax = p.max()
      p -= pmax
      return pmax + log(exp(p).mean())

   p1 = get_avg(probs[0])
   p2 = get_avg(probs[1])

   p_sum = get_avg([p1, p2]) + log(2)
   
   l1 = p1 - p_sum
   l2 = p2 - p_sum

   print("likelihood-ratio for %s: %g" % (missing_edges[0], exp(l1)))
   print("likelihood-ratio for %s: %g" % (missing_edges[1], exp(l2)))


.. testoutput:: missing-edges

   likelihood-ratio for (101, 102): 0.37...
   likelihood-ratio for (17, 56): 0.62...

From which we can conclude that edge :math:`(17, 56)` is more likely
than :math:`(101, 102)` to be a missing edge.

The prediction using the non-nested model can be performed in an
entirely analogous fashion.