A bit of background: Imagine we have a sample S of 10000 participants from a larger, 40000 total population of individuals. For the 10000 sample we have some variables that we're interested in modeling, say the relation between height and diet. We also know that there is some spatial non-independence between the participants that we need to "control for" (let's say geo location with coordinates). We can do this by building a model with a GP for spatial non-independence and then model whatever variables we're interested in. Now, the issue is, we later determine that we also want to study different variables like the relation between amount of exercise and hair color. We now need to find participants in S. We only have access to S, other individuals of the population are unreachable. We then need to annotate this sub-sample T and annotate them for these two variables. However, annotation is very costly, and we can realistically only annotate some 100-200 participants.…