Thursday, November 14, 2013

But is it causal? Defining causality

As I alluded to in my last post, defining what it means for $X$ to cause $Y$ is no simple task.  It is not an idea that can be defined in purely probabilistic terms, because it says something about the mechanisms underlying the system we are studying, and what will happen if we interfere with that system in some way.

Consider the example given at the end of the last post.  The headline was:
How a short nap can raise the risk of diabetes
The implication of this is that the risk of diabetes increases because of the nap.  But what does this mean?

Let's start with a more clear cut example.  Suppose I drop a glass bottle onto a concrete floor and it breaks: we might say that the bottle broke because I dropped it.  In doing so we pack in a couple of ideas:

  1. I dropped the bottle, and it broke; 
  2. if I hadn't dropped the bottle, then it wouldn't have broken.
Oh cruel, cruel world
Oh cruel, cruel world. (Photo by LongPHAM, Creative Commons)
The only difference between the two scenarios is whether or not I dropped the bottle, and yet in one the bottle breaks and in the other it doesn't.  Therefore the action of dropping the bottle directly caused it to break.

Unless you're a philosopher all this seems fairly uncontroversial, because we have a very good understanding of the actual mechanism by which the bottle smashes: gravity pulls the bottle down so that it hits the floor, and in doing so the bottle's kinetic energy is transferred to the brittle glass.  We can be pretty confident when we say that if I hadn't dropped the bottle, it wouldn't have broken, even though this is an event we did not observe, because we can model the physics and have a lot of experience to show that bottles are pretty stable when left to their own devices.  If they weren't then the Jesus College wine cellar would be an awful mess.

Now a trickier example.  Suppose that at the end of term I get a cold; I might say that "it's because I've been working hard and not sleeping enough, my immune system is a bit weaker".  If we follow the previous example, then I must mean that "If I hadn't been working so hard, then I wouldn't have caught a cold."  This is far from clear: people sometimes get colds even if they haven't been working hard.

It might well be the case that working hard and not sleeping enough makes it more likely that one will get a cold (let's assume for the sake of argument that it does).  But it's also possible that I would still have become ill even if I'd taken it easier: perhaps I caught it from my housemate, so less work means more time at home potentially exposed to the virus.  On the other hand, I might have caught the cold from a student during a particular meeting, so if I'd had a different meeting at that time I might have worked just as hard and not got a cold.  It therefore seems much less clear what we mean by the idea that me working hard and not sleeping has caused me to become ill.

Probabilistic Ideas of Causality

The important point in the previous example is the idea that working hard makes it more likely that one would get a cold.  We could base a notion of causality on this: say that $A$ is a cause of $B$ if the probability of $B$ occurring is greater when $A$ happens than when $A$ doesn't happen.

In the case of the bottle, dropping it increased the probability of it breaking from essentially zero (no chance of breaking) to essentially one (will definitely break).  In this sense it is an extreme example.  On the other hand, working hard might increase my chances of getting a cold, but it probably only by a very small amount (an increase from 0.1 to 0.15 would be a very high estimate in my view).  In this case then, there's still a reasonable chance I would have caught a cold anyway, so it seems a very strong statement to have said that I became ill because I was working hard.

In a large group of people though, the idea becomes clearer.  Suppose I take 1000 people, and give them all a relaxing couple of months, with a sensible workload.  I might expect approximately 100 of them to get colds; on the other hand, if I make them all work very hard for those two months, my estimate suggests that around 150 will get colds.  This notion of causality says that more people who work hard will get colds than if those same people took it easy.

Applied to the Daily Mail's headline, the implication is that if we stopped people from taking naps, then fewer of them would get diabetes.  This is a much stronger statement than simply noting the association between the two.  If, as seems rather more plausible, having an underlying illness makes you more likely to feel tired and therefore to nap, then this acting to stop people napping would not have the desired effect at all.  They'd just be more tired, and not have any impact whatever on their chances of developing diabetes.

The notion common to most ideas of probabilistic causality is that the likelihood of an effect, $Y$, is changed when I intervene to change the cause, $X$.  By just passively observing, we might see some association between $X$ and $Y$; but if $X$ causes $Y$, then when I perform an action which somehow changes the value of $X$, it will also change the probability of $Y$ occurring.

Smoking causes lung cancer, in the sense that if everyone stopped smoking we would see less lung cancer.  On the other hand, if I gave everyone a vaccine which stopped people from developing lung cancer, it wouldn't stop people from smoking (perhaps the reverse, in fact!).

Potential Outcomes

The above ideas all refer to probabilities and chances which occur at the population level: I can't say whether or not my specific cold is actually caused by working hard, only that more people have colds when the work hard in a general sense.  So can we define a notion of causality at an individual level?

Imagine that two worlds 'exist': the world we observed, in which I worked hard and then got a cold, and a second counterfactual world which was essentially the same, except that I took it easy.  Let $Y_{\text{hard}}$ and $Y_{\text{easy}}$ be variables which denote whether or not I get sick under each scenario (1 for ill, 0 for not ill): we know $Y_{\text{hard}} = 1$ because I actually did work hard, and I did get a cold.  On the other hand we don't know anything about $Y_{\text{easy}}$: this is an outcome associated with the counterfactual world in which I took it easy, so we can't observe it.

We can consider the pair of values $(Y_{\text{hard}}, Y_{\text{easy}})$ as potential outcomes for an individual, just by assuming that the value of $Y_{\text{easy}}$ exists and is well defined (even though it is not observable, even in principle.  We can use the values of this pair of numbers to divide the population into groups of people based on how they individually respond in the two different worlds.  This idea (due to Donald Rubin) is both powerful and controversial, and deserves a future post of its own; I will say no more now for the sake of brevity, other than that the idea is mathematically very useful, whether you find it completely natural or philosophically challenging.

Testing Causal Questions

Now that we've defined some ideas of probabilistic causality, the real question is this: how can we answer causal questions?  As I've already suggested, the 'gold standard' method is to do a randomised trial.  We take a large group of people, and randomly divide them into two groups.  The first group are all treated with $X=1$ and the second with $X=0$, but otherwise they are treated the same; if the number of people with $Y=1$ differs between the two groups (in a statistically and scientifically significant way), then there is a causal effect of $X$ on $Y$ in the sense defined above.

If we can't do this for some reason, whether practical or ethical, then life is rather harder.  Next week I will discuss one popular approach to this, known as instrumental variables.