# The counterfactual

The counterfactual is a complex notion for statistics, with an offshoot into philosophy witness this entry in the Stanford Encyclopedia of Philosophy. Judea Pearl in his fabulous book Causality (2000:33-34) states:

“(…) asking what percentage Q of subjects who died under treatment would have recovered had they not taken the treatment – will encounter (…) difficulties because none of these subjects was tested under the no-treatment condition. Such difficulties have prompted some statisticians to dismiss counterfactual questions as metaphysical and to advocate the restriction of statistical analysis to only those questions that can be answered by direct tests (…)”.

There is a difference between metaphysics and reasonable issues that are difficult to measure. A problem is that mathematics has tended to base statistics within probability theory while the material sciences are also concerned about causality.

A highschool textbook question gives a straightforward example how the counterfactual arises naturally.

The question describes that farmers are hindered by rats eating their crops. They may hire one or two rat catchers. The students are asked to create the formulas and calculate and plot various outcomes. The highschool discussion stops there since the learning goal is limited to understanding recursive forms. We can look a bit deeper at the implied counterfactual however.

The example starts with 1400 rats which population grows with 40% per period. A catcher can catch 400 rats per period. For a single catcher the highschool students must construct the formula is r[t] = 1.4 * r[t-1] – 400, with r[0] = 1400. What is it for two catchers ? Let us call them John and Paul.

The highschool textbook seems to allow the answer r[t] = 1.4 r[t-1] – 800. In survival analysis we however know about the effect of competing causes. It may be that Paul catches a rat that only a few hours later would have been caught by John if Paul hadn’t been earlier, or conversely. They might both stumble on the same rat and catch it both: rather than recording such double catches, they may allot such catches to each in turn. This creates the statistical difference between a catch and a rat, where a catch may indicate only a half rat. There is also the difference between the joint operation of two catchers and the conjoint event of both catching the same rat.

The population of rats at the end of the period would be 1.4 * 1400 = 1960 (which is our first counterfactual) except for the fact that a single catcher diminishes this by 400. Thus the catch rate is f = 400 / 1960 ~ 20% and the survival rate is s = 1560 / 1960 ~ 80%.

Assuming that the catchers are equally effective, the joint rat survival rate is (1 – f)^2 = s^2. With two catchers, at the end of the period there are s^2 * 1960 = 1242 surviving rats. Thus jointly 1960 – 1242 = 718 rats are caught. Assuming independence, each catches 359 rats, which is less than the single result of 400. Overall we would get the table on the left (for rats). That table generates a marginal success rate of 359 of 1960 rats, which marginal rate apparently is conditional on the presence of another catcher.

If we want to maintain the original marginal catch rate of 400 of 1960 rats, then we get the table on the right (for catches). It follows that 82 rats would be caught by Paul and John conjointly. The latter is a pure counterfactual, since it would be hard to determine which rat is caught by the one that otherwise would have been caught by the other. (Marking a rat and releasing it again might be an option but this assumes no affect on its behaviour like going into hiding.)

 John not-John Total John not-John Total Paul 0 359 359 82 318 400 not-Paul 359 1242 1601 318 1242 1560 Total 359 1601 1960 400 1560 1960

This number of 82 thus is the counterfactual that comes about straightforwardly in a fairly simple case.

That the counterfactuals exists should not be a problem for statistics, epidemiology and philosophy. Once you start modelling, counterfactuals pop up by implication. The problem is only that some issues are difficult to measure.

In this rat case it would be smart to assign John and Paul different areas so that they can avoid getting in each other’s way. This is common sense and could be assumed in the textbook question. This also assumes that the catch rate depends upon density and that the density doesn’t differ per assigned area, and so on.

Issues become more complex when epidemiology considers different causes of death (other than John and Paul). Who dies from a heart attack can no longer die from cancer. In that case the observations provide us with the table on the left while the table on the right is a figment of our imagination. It is amazing how much still can be said empirically, however. At some point though the general cause of “old age” takes over and statistics may become polluted when it is tried still to identify a single cause.

Counterfactuals might have a bad name. If the moon were made of green cheese then the trees would grow to heaven, is the common counterexample to hypothetical arguments. We are here in the realm of literature. This attitude isn’t reasonable for science however when the questions concern real issues.

Overall it might be wiser to look first at an argument itself and worry less about the implied counterfactual. A focus on the counterfactual might induce the idea that it isn’t relevant since it isn’t factual or part of reality, but such an attitude destroys the very process of argumentation.

I came to writing this because of that textbook question and reading in Pearl, and wondering why such issues aren’t discussed accessibly in highschool. Students would learn more than just constructing recursive formulas.

For this weblog I may add that the argument “If the world would boycott Holland …” should better be judged on its merit so that the present counterfactual has a larger chance of becoming factual.

PS. 1

Let us recover the hidden death and survival rates from the data, assuming independence. We assume one cause C with death rate f and other causes OC with death rate g. The conjoint catch rate is f g. For cause C its share in the joint catch may be taken as f / (f + g). Formally we have the following tables, with the total population normalized to 1.

 C not-C Total C not-C Total OC 0 y y f g g (1 – f) g not-OC x (1–f) (1–g) 1 – y f (1–g) (1–f) (1–g) 1 – g Total x 1 – x 1 f 1 – f 1

x = f (1 – g) + f / (f + g) * f g
y = g(1 – f) + g / (f + g) * f g

In a numerical example, let the observations be given as in the table of the left. Then we can solve the equations (1 – f) (1 – g) = 0.64 and y = g (1 – f) + g / (f + g) * f g = 0.17. We find the solution values on the right.

 C not-C Total C not-C Total OC 0 0.17 0.17 0.040 0.151 0.191 not-OC 0.19 0.64 0.83 0.169 0.640 0.809 Total 0.19 0.81 1 0.209 0.791 1

When C and OC are not independent then other tricks are required, which depend upon the case at hand. When such causes are interdependent, like when a general rise of disease reduces immunity and affects the various states, then we would look for deeper causes.

(PM 1. The standard approach in survival analysis has the competing risk model. It is a bit awkward that I cannot quickly point to the possible similarities and differences in the above approach with that standard survival approach, and have to look into this further. PM 2. Let me indicate a study by Mackenbach et al. (1999) on competing risks that aren’t independent.)

PS. 2

Pearl (2009:379) is stern on econometrics: “In almost every one of his recent articles James Heckman stresses the importance of counterfactuals as a necessary component of economic analysis and the hallmark of econometric achievement in the past century. For example, the first paragraph of the HV article reads: “they [policy comparisons] require that the economist construct counterfactuals. Counterfactuals are required to forecast the effects of policies that have been tried in one environment but are proposed to be applied in new environments and to forecast the effects of new policies.” Likewise, in his Sociological Methodology article (2005), Heckman states: “Economists since the time of Haavelmo (1943, 1944) have recognized the need for precise models to construct counterfactuals… The econometric framework is explicit about how counterfactuals are generated and how interventions are assigned…” And yet, despite the proclaimed centrality of counterfactuals in econometric analysis, a curious reader will be hard pressed to identify even one econometric article or textbook in the past 40 years in which counterfactuals or causal effects are formally defined. Needed is a procedure for computing the counterfactual Y(x, u) in a well-posed, fully specified economic model, with X and Y two arbitrary variables in the model. By rejecting Haavelmo’s definition of Y(x, u), based on surgery, Heckman commits econometrics to another decade of division and ambiguity, with two antagonistic camps working in almost total isolation.” Notice that Haavelmo’s paper tended to cause econometricians to replace Tinbergen’s path analysis (advocated by Pearl) with significance testing (see Ziliak & McCloskey 2007). There is still work to be done.