Opinion: Statistical Misconceptions

Researchers must be wary of the common mistakes of correlation analysis when drawing conclusions about the nature of their data.

By | July 31, 2013

WIKIMEDIA, SAM DAOUDStatistics are the basis of scientific data analysis, and with the flood of data coming from new genomics technologies, biostatistics has truly become an inseparable part of modern science. Nevertheless, a fundamental statistical technique—correlation analysis, which measures the relationship between two variables—is often employed incorrectly, leading to erroneous conclusions about the true nature of the relationship between the studied phenomena.

The primary task of correlation analysis is to test for a relationship, or agreement, between two variables of interest—say smoking and higher incidence of lung cancer. Furthermore, provided that the survey was carried out on a sufficiently large sample, a rough assessment of the degree of correlation between the observed phenomena, quantified as the linear correlation coefficient, can be performed.

This coefficient must then be interpreted and critically analyzed, as correlation analysis does not aim at explaining the nature of the quantitative agreement—in other words, the causal relationship between the two variables. In addition to assuming causality, researchers commonly fall victim to two other misconceptions: inferring the nature of the individual based on the group findings, and thinking that a correlation of zero implies independence. Each of these errors in analysis can lead to inadequate conclusions.

Misconception #1: Correlation implies causality

Every scientist knows that “correlation does not imply causation.”  Indeed, both variables may incidentally show the same tendency of quantitative variability without any logical and natural relationship between them at all. Alternatively, two variables may trend together since they are under the impact of the same confounding factors that are causing the changes in both. Nevertheless, the inappropriate assumption of causality is the biggest source of error in interpreting the results of correlation analysis.

In 2008, for example, the Journal of Pediatrics published a study in which the authors concluded that eating breakfast can solve the problem of teenage obesity, based simply on the fact that teenagers who do eat breakfast are less likely to be obese. Although the correlation found by the authors indicates a possible causality, it is unlikely that eating breakfast can solve the potential problem of teenage obesity. More likely, there is a common cause behind these two phenomena (eating breakfast and teenage obesity)—poverty, for example—but no direct relationship between them. Similar examples of authors misinterpreting the correlation coefficient are common in the epidemiological literature. One group of researchers, for example, found a correlation between women taking combined hormone replacement therapy (HRT) and a lower-than-average incidence of coronary heart disease (CHD) and concluded that HRT lowered the risk of CHD. However, randomized controlled trials have found the contrary: HRT caused the increase in risk of CHD. It was later determined that lower-than-average incidence of CHD is caused by the benefits associated with a higher average socioeconomic status of those taking HRT, not by therapy itself.

Studies including this type of error are published even in leading biomedical journals. For example, a 1999 Nature study found a strong association between myopia, or near-sightedness, and night-time ambient light exposure during sleep in children. The authors concluded that it seems prudent that infants and young children sleep at night without artificial lighting in the bedroom. A later study refuted these findings and reported that, in this case, the cause of myopia was genetic, not environmental, as many of the study participants’ parents also suffered from the condition.

Of course, the fact that “correlation does not imply causation” should not lead towards diametrically opposite conclusions that correlation could not point to a possible existence of causality. Correlations, especially the high value of the linear correlation coefficient, may point to the existence of causality, but the conclusion requires systematic examination.

Misconception #2: Individuals follow the group

It is not always possible to make inferences about the nature of individuals from information about the group to which those individuals belong. Many researchers do make such assumptions, however, thereby falling victim to the ecological inference fallacy.

One example of ecological inference fallacy is a 2012 paper in a New England Journal of Medicine: the study author found that there was a close and significant linear correlation between chocolate consumption per capita and the number of Nobel laureates per 10 million persons in a total of 23 countries. On the basis of this finding, he concluded that chocolate consumption enhances cognitive function and closely correlates with the number of Nobel laureates in each country. But without accurate data at the individual level, it is impossible to draw such a conclusion. For example, it was unknown how much and whether Nobel laureates consumed chocolate.

Misconception #3: A correlation of zero implies independence

Based on the previous two examples, it is clear that high values of the linear correlation coefficient cannot by themselves be sufficient to conclude about the relationship between the variables. Conversely, a correlation coefficient of zero does not mean that the variables are independent. That is because the correlation coefficient measures linear association only. A U-shaped, non-monotonic relationship, for example, may have a correlation of zero, such as the dose-response relationship in steroid hormone receptor-mediated gene expression.

Conclusion

Proper, clear, and correct use of biostatistical methods requires not only adequate knowledge in biostatistics but also continuing education in this field. In that regard, biostatisticians trained in these methods should be involved in the research from the very beginning, not after the measurement, observation, or experiments are completed.

Vladica M. Veli?kovi? is an assistant in the department of public health at the University of Niš in the Republic of Serbia.

Add a Comment

You

Processing...

Sign In with your LabX Media Group Passport to leave a comment

Not a member? Register Now!

Comments

Eric J. Murphy

Posts: 20

July 31, 2013

Spot on.  Too much poor science is done where correlation analysis is the last standing means by which to come up with some kind of story.  Yet it is quite common for individuals in the science community to then infer cause and effect, leading to a volumous media blitz and a resulting complete misunderstanding by the public.  We have seen this and will continue to see this be done.

Nonetheless, this is a very nicely written piece speaking to a very important issue.  Statistical analysis in general is generally poorly done by a lot of scientists, so I will add this teaching point to the responsible publishing part of my ethics course.

tomm1

Posts: 1

August 3, 2013

From the above: ". ..A U-shaped, non-monotonic relationship, for example, may have a correlation of zero,..."

the problem of non-linear data can be correct by using log scales bfore running the correlation.

Keith Nordstrom

Posts: 1

August 6, 2013

Misconception #2 can also be supported mathematically (and *thank* you for posting it, I've had so many people give me blank stares when I've said it).  A population study always comes up with a Bell Curve or Gaussian curve for one reason: it can be shown from first principles that any set of uncorrelated random variables - governed by any (bounded variance) distribution - will produce a Gaussian curve when combined.

In other words, if you get a Bell Curve in your Psych study, it only means that people weren't cheating.  It arises simply because each person, treated as a random point on some personal distribution (because our answers to questionnaires are not deterministic) is effectively a random variable.

But anytime you have a symmetry like this, it also means you've lost information.  Because you've studied the population - which has to be a Bell Curve - you cannot know what the distribution of any one individual looks like.  It might be a Bell Curve or it might be something else; and worse, there's no reason why it has to peak where your population curve does, so none of your moments will apply either.