Simpson's paradox


Simpson's paradox, which goes by several names, is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. This result is often encountered in social-science and medical-science statistics and is particularly problematic when frequency data is unduly given causal interpretations. The paradox can be resolved when causal relations are appropriately addressed in the statistical modeling.
Simpson's paradox has been used as an exemplar to illustrate to the non-specialist or public audience the kind of misleading results mis-applied statistics can generate. Martin Gardner wrote a popular account of Simpson's paradox in his March 1976 Mathematical Games column in Scientific American.
Edward H. Simpson first described this phenomenon in a technical paper in 1951, but the statisticians Karl Pearson et al., in 1899, and Udny Yule, in 1903, had mentioned similar effects earlier. The name Simpson's paradox was introduced by Colin R. Blyth in 1972.
It is also referred to as Simpson's reversal, Yule–Simpson effect, amalgamation paradox, or reversal paradox.

Examples

UC Berkeley gender bias

One of the best-known examples of Simpson's paradox is a study of gender bias among graduate school admissions to University of California, Berkeley. The admission figures for the fall of 1973 showed that men applying were more likely than women to be admitted, and the difference was so large that it was unlikely to be due to chance.
However, when examining the individual departments, it appeared that six out of 85 departments were significantly biased against men, whereas four were significantly biased against women. In fact, the pooled and corrected data showed a "small but statistically significant bias in favor of women". The data from the six largest departments are listed below, the top two departments by number of applicants for each gender italicised.
The research paper by Bickel et al. concluded that women tended to apply to competitive departments with low rates of admission even among qualified applicants, whereas men tended to apply to less-competitive departments with high rates of admission among the qualified applicants.

Kidney stone treatment

This is a real-life example from a medical study comparing the success rates of two treatments for kidney stones.
The table below shows the success rates and numbers of treatments for treatments involving both small and large kidney stones, where Treatment A includes all open surgical procedures and Treatment B is percutaneous nephrolithotomy. The numbers in parentheses indicate the number of success cases over the total size of the group.
Treatment ATreatment B
Small stonesGroup 1
93%
Group 2
87%
Large stonesGroup 3
73%
Group 4
69%
Both78% 83%

The paradoxical conclusion is that treatment A is more effective when used on small stones, and also when used on large stones, yet treatment B is more effective when considering both sizes at the same time. In this example, the "lurking" variable is the severity of the case, which was not previously known to be important until its effects were included.
Which treatment is considered better is determined by an inequality between two ratios. The reversal of the inequality between the ratios, which creates Simpson's paradox, happens because two effects occur together:
  1. The sizes of the groups, which are combined when the lurking variable is ignored, are very different. Doctors tend to give the severe cases the better treatment, and the milder cases the inferior treatment. Therefore, the totals are dominated by groups 3 and 2, and not by the two much smaller groups 1 and 4.
  2. The lurking variable has a large effect on the ratios; i.e., the success rate is more strongly influenced by the severity of the case than by the choice of treatment. Therefore, the group of patients with large stones using treatment A does worse than the group with small stones, even if the latter used the inferior treatment B.
Based on these effects, the paradoxical result is seen to arise by suppression of the causal effect of the severity of the case on successful treatment. The paradoxical result can be rephrased more accurately as follows: When the less effective treatment is applied more frequently to less severe cases, it can appear to be a more effective treatment.

Batting averages

A common example of Simpson's paradox involves the batting averages of players in professional baseball. It is possible for one player to have a higher batting average than another player each year for a number of years, but to have a lower batting average across all of those years. This phenomenon can occur when there are large differences in the number of at bats between the years. Mathematician Ken Ross demonstrated this using the batting average of two baseball players, Derek Jeter and David Justice, during the years 1995 and 1996:
In both 1995 and 1996, Justice had a higher batting average than Jeter did. However, when the two baseball seasons are combined, Jeter shows a higher batting average than Justice. According to Ross, this phenomenon would be observed about once per year among the possible pairs of players.

Racial disparity in the death penalty

This real-life example is taken from Radelet. The data are from twenty Florida counties during 1976-1977.
WhiteBlackBoth
White13% 17% 14%
Black0% 6% 5%
Both12% 10% -

When disaggregating the data by the race of the victim, it appears that black defendants are more likely to be sentenced to the death penalty. However, since most victims are white, crimes against white victims have higher sentencing rates, and most of the crimes against white victims were committed by white defendants, the aggregate data indicates that white defendants are more likely to be sentenced to the death penalty.

Vector interpretation

Simpson's paradox can also be illustrated using the 2-dimensional vector space. A success rate of can be represented by a vector, with a slope of. A steeper vector then represents a greater success rate. If two rates and are combined, as in the examples given above, the result can be represented by the sum of the vectors and, which according to the parallelogram rule is the vector, with slope.
Simpson's paradox says that even if a vector has a smaller slope than another vector , and has a smaller slope than, the sum of the two vectors can potentially still have a larger slope than the sum of the two vectors, as shown in the example. For this to occur one of the orange vectors must have a greater slope than one of the blue vectors, and these will generally be longer than the alternatively subscripted vectors — thereby dominating the overall comparison.

Correlation between variables

Simpson's paradox can also arise in correlations, in which two variables appear to have a positive correlation towards one another, when in fact they have a negative correlation, the reversal having been brought about by a "lurking" confounder. Berman et al. give an example from economics, where a dataset suggests overall demand is positively correlated with price, in contradiction of expectation. Analysis reveals time to be the confounding variable: plotting both price and demand against time reveals the expected negative correlation over various periods, which then reverses to become positive if the influence of time is ignored by simply plotting demand against price.

Implications for decision making

The practical significance of Simpson's paradox surfaces in decision making situations where it poses the following dilemma: Which data should we consult in choosing an action, the aggregated or the partitioned? In the Kidney Stone example above, it is clear that if one is diagnosed with "Small Stones" or "Large Stones" the data for the respective subpopulation should be consulted and Treatment A would be preferred to Treatment B. But what if a patient is not diagnosed, and the size of the stone is not known; would it be appropriate to consult the aggregated data and administer Treatment B? This would stand contrary to common sense; a treatment that is preferred both under one condition and under its negation should also be preferred when the condition is unknown.
On the other hand, if the partitioned data is to be preferred a priori, what prevents one from partitioning the data into arbitrary sub-categories artificially constructed to yield wrong choices of treatments? Pearl shows that, indeed, in many cases it is the aggregated, not the partitioned data that gives the correct choice of action. Worse yet, given the same table, one should sometimes follow the partitioned and sometimes the aggregated data, depending on the story behind the data, with each story dictating its own choice. Pearl considers this to be the real paradox behind Simpson's reversal.
As to why and how a story, not data, should dictate choices, the answer is that it is the story which encodes the causal relationships among the variables. Once we explicate these relationships and represent them formally, we can test which partition gives the correct treatment preference. For example, if we represent causal relationships in a graph called "causal diagram", we can test whether nodes that represent the proposed partition intercept spurious paths in the diagram. This test, called the "back-door criterion", reduces Simpson's paradox to an exercise in graph theory.

Psychology

Psychological interest in Simpson's paradox seeks to explain why people deem sign reversal to be impossible at first, offended by the idea that an action preferred both under one condition and under its negation should be rejected when the condition is unknown. The question is where people get this strong intuition from, and how it is encoded in the mind.
Simpson's paradox demonstrates that this intuition cannot be derived from either classical logic or probability calculus alone, and thus led philosophers to speculate that it is supported by an innate causal logic that guides people in reasoning about actions and their consequences. Savage's sure-thing principle is an example of what such logic may entail. A qualified version of Savage's sure thing principle can indeed be derived from Pearl's do-calculus and reads: "An action A that increases the probability of an event B in each subpopulation Ci of C must also increase the probability of B in the population as a whole, provided that the action does not change the distribution of the subpopulations." This suggests that knowledge about actions and consequences is stored in a form resembling Causal Bayesian Networks.

Probability

A paper by Pavlides and Perlman presents a proof, due to Hadjicostas, that in a random 2 × 2 × 2 table with uniform distribution, Simpson's paradox will occur with a probability of exactly 1/60. A study by Kock suggests that the probability that Simpson’s paradox would occur at random in path models with two predictors and one criterion variable is approximately 12.8 percent; slightly higher than 1 occurrence per 8 path models.

Simpson's second paradox

A “second” less well-known Simpson’s paradox was discussed in his 1951 paper.  It can occur when the rational interpretation need not be found in the separate table but may instead reside in the combined table. Which form of the data should be used hinges on the background and the process giving rise to the data.
Norton and Divine give a hypothetical example of the second paradox.