Biological Optimisation as Machine Learning

The biotech field has long been in the business of trying to find new molecules or mechanisms that achieve improved function compared with what is readily available in nature. Fundamentally, this can be thought of as an optimisation problem in a very large search space. For machine learning scientists, as soon as we hear “optimisation problem” we immediately think “how can we do it better than anyone else with machine learning?

Bounds on Bounds

This is a TODO that reminds me to eventually write a post on limitations of MI estimators and alternative solutions.

Notes on Notation

This post is a living document of various ideas on mathematical notation that I think are interesting. Tuple or Set? I often encounter the situation where I have a collection of mathematical objects whose ordering doesn’t matter. The default is usually to pronounce such a collection to be a tuple, and it’s common to hear people colloquially refer to any collection as a (n-)tuple even if ordering is irrelevant. For example, if you’re computing a Monte Carlo estimate of an expectation, you typically have a collection of samples and calculate their average, but how should you write this?

Upper and Lower Mutual Information Bounds

Recently Foster et al (2020) introduced a couple of tractable bounds that can be used to estimate the expected information gain (EIG) of a design policy (a mapping from past designs and observations to the next design). These include the sequential Prior Contrastive Estimate (sPCE) lower bound: $$ \mathcal{L}_T(\pi) = \mathbb{E} \left[ \log \frac{p(h_T | \theta_0, \pi)}{\frac{1}{L+1} \sum_0^L p(h_T | \theta_l, \pi)} \right] $$ where $h_T = (d_0, y_0,\dots,d_T,y_T)$ is the experimental history at time $T$, $\pi$ is the design policy, $L$ is the number of contrastiv samples and $\theta$ parameterises the experimental model $p(y|\theta, d)$ that maps designs to a distribution over outcomes.