Upper and Lower Mutual Information Bounds

Recently Foster et al (2020) introduced a couple of tractable bounds that can be used to estimate the expected information gain (EIG) of a design policy (a mapping from past designs and observations to the next design). These include the sequential Prior Contrastive Estimate (sPCE) lower bound:

LT(π)=E[logp(hT|θ0,π)1L+1L0p(hT|θl,π)]

where hT=(d0,y0,,dT,yT) is the experimental history at time T, π is the design policy, L is the number of contrastiv samples and θ parameterises the experimental model p(y|θ,d) that maps designs to a distribution over outcomes. The expectation in the sPCE is w.r.t. the distributions θ0,hTp(θ)Tt=1p(yt|θ,π(ht1) and θ1:Lp(θ) where p(θ) is some prior.

We also have the corresponding sequential Nested Monte Carlo (sNMC) upper bound:

UT(π)=E[logp(hT|θ0,π)1LL1p(hT|θl,π)]

where the only difference is that we don’t include l=0 in the denominator. Since the objective is usually to maximise EIG, the intuitive approach would be to maximise the lower (and hope that it actually drives the EIG up, rather than just tightening the bound). However, as is typical for mutual information lower bounds, the sPCE is upper-bounded by sPCElog(L+1) which you can verify for yourself by noting that p(hT|θ0,π) appears in both the numerator and denominator, and remembering that probabilities are always non-negative. That means if the actual EIG is large, we would need an exponentially large L to estimate it accurately.

This might lead you to wonder if we could use the upper bound instead, which does not suffer the same limitation. Intuitively it doesn’t make much sense to maximise a quantity by maximising an upper bound, because we might just be making the bound arbitrarily loose. But it turns out there is a monotonicity property connecting the sPCE and sNMC.

Let π1,π2 be two design policies, and for simplicity we will write L1,L2 and U1,U2 to denote the lower and upper bounds of those policies as estimated with arbitrary L and T.

Theorem 1 the sPCE lower bound and sNMC upper bound induce the same total order on policy space: U2>U1L2>L1π1,π2Π U2=U1L2=L1π1,π2Π

Proof to simplify the following analysis, we introduce the notation: a(π)=p(hT|θ0,π)c(L,π)=L1p(hT|θl,π) this lets us rewrite the bounds as follows, where we omit the dependency of a,c on L and π for brevity: L(L,π)=log(L+1)aa+cU(L,π)=logLac Now suppose there exist policies π1 and π2 s.t. a(π2)=ba(π1)c(L,π2)=dc(L,π1) for some constant scalars b,d. For convenience, we will denote ai=a(πi);ci=c(L,πi). The differences between the bounds for the two policies are: U2U1=logLba1dc1logLa1c1=logLba1dc1c1La1=logbd L2L1=log(L+1)ba1ba1+dc1log(L+1)a1a1+c1=log(L+1)ba1ba1+dc1a1+c1(L+1)a1=logb(a1+c1)ba1+dc1 Now note that we can rewrite the expression for U1 to get an expression for a1 in terms of the upper bound: U1=logLa1c1 eU1=La1c1 a1=c1eU1L

plugging this back into the difference between lower bounds gives us: L2L1=logb(1Lc1eU1+c1)b1Lc1eU1+dc1=logbc1(1LeU1+1)c1(1LbeU1+d)=log1LbeU1+b1LbeU1+d

Now we subtract logdd=0 to both sides of the equation:

L2L10=log1LbeU1+b1LbeU1+dlogdd L2L1=log1LbeU1+b1LbeU1+d1d1d L2L1=log1LbdeU1+bd1LbdeU1+dd

Looking at the difference between upper bounds, we have that bd=eU2U1, therefore

L2L1=log1LeU2U1eU1+eU2U11LeU2U1eU1+1 L2L1=log1LeU2+eU2U11LeU2+1 Which gives us an expression for the difference L2L1 purely in terms of the corresponding upper bounds.

Now, we note that if U2>U1, then eU2U1>1 and therefore

L2L1=log1LeU2+eU2U11LeU2+1>0 which implies that L2>L1. The procedure can be reversed to show that L2>L1U2>U1, thus proving the first claim of the theorem.

Similarly, we can note that if U2=U1 then eU2U1=1, which by the same procedure as before shows that U2=U1L2=L1, and again we can reverse the procedure to show that L2=L1U2=U1, thus proving the second claim of the theorem

Q.E.D.

So what does this all mean? In theory, we can maximise the sNMC upper bound, and this indirectly maximise the sPCE lower bound, which we hope will drive up the actual mutual information. In fact, the two optimisations are in some sense equivalent, as increasing either bound must increase the other by a predictable, deterministic amount. In practice, my experience is that both approaches perform equally well.

References:

Foster, A., Ivanova, D. R., Malik, I., & Rainforth, T. “Deep adaptive design: Amortizing sequential bayesian experimental design.” International Conference on Machine Learning. PMLR, 2021.