How to study studies


There's a lot of cynicism about science these days, fed by every headline that declares one day that some food is good for you and the next that it's bad for you, every report about some researcher faking his experimental results, and every "study" trotted out by some con artist claiming that his new supplement will cure all ills. (Note: throughout this post, I'm going to use "study" and "experiment" interchangeably, though they don't quite mean the same things, and I'll be thinking mostly about biological research, though what I say will apply to greater or lesser degrees to the other branches of science.) It can seem to some like science is not working very well, or worse, that it's fundamentally flawed. And indeed, anyone trying to devise an experiment to discover a truth about the world faces many potential difficulties, including:

  • Flawed design. It's really, really hard to design an experiment that proves anything complex enough to make a real difference in the world.
  • Human error. Even well trained, well intentioned scientists can make mistakes in the conducting of experiments.
  • Randomness. Especially if you're experimenting on living things, you're going to get some degree of randomness. Life is complex, and one frog might not react exactly the same as another frog.
  • Unintentional bias. Until our robot overloads take over, scientific experiments are going to be designed, conducted, and interpreted by human beings with conscious or unconscious biases.
  • Intentional fraud. Yes, it does happen that scientists will fake results, whether for purely monetary reasons (e.g. to advance their own careers or enable some scam) or ideological ones. (To support some preexisting view they hold.)

Faced with these obstacles, we might doubt that science can prove anything, but what we'd be forgetting is the power of human intelligence. We're a problem solving species, and over the centuries, we've developed a process for deriving truth from the messiness of experimental science.

  1. We design our experiments in accordance with a set of best practices that we've continue to refine over the years. Two good examples of these practices would be the requirements that studies involving people be "double blinded" and "randomized." A more recent example would be the requirement that researchers state their objectives ahead of time to avoid the practice of "p-hacking." As we discover new ways that studies can be flawed, we develop new guidelines to help us avoid those pitfalls.
  2. We require scientific papers to be peer reviewed before they can be published. This allows experts in the same field to confirm that the design of the study or experiment was sound and that the researchers properly processed and interpreted the data.
  3. We talk non-judgmentally about the strengths of studies. A study that followed all the best practices and involved a large and sufficiently random sample size would be considered strong, while a study that involved a smaller sample size or that had some design flaws might be thought of as weaker. However, "weak" is not necessarily a pejorative. Many studies are intentionally weak, because it's more expensive and time consuming to conduct a strong study, and in a world of limited resources, we can use weaker studies to probe for new areas that might be worth more rigorous testing later.
  4. We attempt to reproduce experiments. This is such an important point that it bears repeating. We attempt to reproduce experiments! The biggest misunderstanding that non-scientists have about science is that it relies on single studies to prove anything. After a study is published that seems to prove some new thing, other scientists try to repeat the experiment, following its steps precisely and seeing if they get the same results. This helps to address almost all the problems that studies can have: human error, randomness, unconscious bias, and intentional fraud. Every time a study is reproduced successfully, it becomes less likely that the same errors are being made, or that the randomness is breaking the same way every time, or that the new experimenters share the biases or evil intentions of the original team.
  5. Only when a sufficient strength of findings clusters convincingly enough around a given point do we consider that the point has been proven. And the key word here is "cluster." It's expected that whatever the truth of a question is, the experiments trying to prove it will lead to results that have some randomness in them. So, for example, if a given vitamin can, in truth, extend the average human life span by 1%, then we should expect that some weaker studies might even come up with answers below zero (i.e. that this vitamin is harmful) while others might show that it has no benefit or a much bigger benefit. But the critical question is: after we adjust how much weight we put on a given study based on its strength, do all the studies generally cluster around the same point? If so, then that point is probably the right one.

Most of the "problems" we see in science are based on a misunderstanding about how it's supposed to work. We think that experiments are supposed to be self sufficient proofs unto themselves, when they are really just pieces of a very big and complicated puzzle. In fact, the most serious legitimate problem with science today is that not enough researchers are doing the unglamorous but necessary work of replicating past studies. In an ideal world, science would have the funding, and scientists would have the incentives, to complete the process outlined above for everything we want to know – and then, ironically, there'd be even more cases of apparently conflicting studies that the news media could sensationalize and that naïve readers could take as signs that science was broken. But then, in that ideal world, everyone would also understand that these studies weren't conflicting at all, but merely science homing in on truth.

Meanwhile, the next time you hear about some study that seems to disprove everything that science was saying about the subject just last year, run it through the following gauntlet:

  1. How strong was this study in comparison to the studies that had been done before? If the studies that had been done before were not very strong in the first place, then maybe the question is simply still up in the air, and it's natural to see results going back and forth for a while. In these cases, it's important not to oversell any given result, as if some established law of nature was being overthrown. Real scientists rarely sensationalize their findings like this, but the PR departments of their labs might, and news media headline writers do it all the time.
  2. If the new study is very weak in comparison to the studies that had been done before, then you shouldn't give it much weight. Weak studies have the power to suggest new avenues for more rigorous research, or to bolster existing knowledge, but they can't overthrow strong past findings, unless they reveal some big design flaw in those past studies. (In which case, stronger follow up studies are still needed before any conclusions can be reached.)
  3. If the design of the contradictory study was strong, has it been reproduced yet? Until it can be reproduced by independent third parties, there's always the possibility of human error, bias, or outright deception.
  4. If a study passes all these tests, then how big was the difference that it found in the first place? Remember randomness and clustering. It's almost guaranteed that whatever the truth is, your results will cluster around that point rather than sit right on top of it exactly. So, for example, if the truth is that, say, there's no correlation between cracking your knuckles and getting arthritis later in life, it's almost certain that if enough studies are done on the subject, you'll get some that seem to show that cracking your knuckles does increase your chances of arthritis, as well as others that show it reduces your odds! Taken individually, these studies can lead to confusion and science cynicism, but the key is to take them together. Collectively, they cluster around the true answer: zero correlation.