Data Irreproducibility: The “Waste, Fraud, and Abuse” of Scientific Research
Originally published 2/12/2018
Waste, fraud, and abuse. We’ve heard the phrase a million times. Politicians tell us the key to making things better in our society is to simply eliminate this unholy trio of troubles. They never bother to detail specifically what they’re referring to, for attempting to do so might set their lips ablaze. Their declarations always elicit feelings of déjà vu. Where, precisely, is the waste located, and how much of it is there? Who, exactly, are the specific groups or individuals that have committed fraud? What types of abuse are we talking about? Is the problem really widespread, or limited to just a few cases? If you’ve really identified the root cause of these problems, what’s it going to take to eliminate them? And since this expression is uttered so frequently, why haven’t these problems been fixed by now? Sadly, this meaningless catchphrase is trotted out whenever the speaker has no real solutions to offer but pretends to have deep insights that are never actually enunciated.
Data irreproducibility, which is a real problem, has become the “waste, fraud, and abuse” of scientific research. It’s actual frequency is unclear, and critics complain about it without offering workable solutions. There’s no doubt that is slows the progress of science. The reasons that it occurs have been investigated widely, and a few specific causes have been found. Implementing solutions is challenging, though, because it requires people to change how they think and conduct their experiments or clinical trials. A layperson’s guide to the problem is outlined in Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions by NPR reporter Richard Harris (2017). His book is a detailed examination of irreproducible data that’s found in both basic biology as well as clinical medicine. The problem appears to have gotten much worse in recent years as pressure builds on academic scientists and doctors to make that great discovery that will define and advance their careers.
Suggestions for increasing the replicability of scientific studies have been proposed by a number of well-respected scientists. Unfortunately, little progress has been made in adopting these steps. That’s because at least some of the fundamental causes of the problem are baked into societal cultural norms, and getting those to change is (not surprisingly) incredibly difficult. Training programs in both research and clinical medicine would do well to make the Rigor Mortis book required reading. It might help researchers recognize the inherent biases that are endemic in most scientific disciplines, and learn from the mistakes of their peers.
So what are the key issues?
Bad Stats Are A Big Problem
Some of the problems outlined in the book would appear to be readily solvable with changes in experimental protocols. When I say these problems are fixable, I mean that in an abstract, not practical, sense. Poor use and interpretation of statistics plague many research projects. Most researchers don’t have a good background in statistics, nor do they have ready access to a statistician who can help them. Classes in avoiding data irreproducibility are seldom offered to graduate and medical students, and few take courses in experimental statistics. Grants may not have budgeted funds to pay for hiring statistical help, and finding a willing collaborator skilled in the arts of statistical analysis may take a bit of work. Future grants may be reviewed more favorably if they include a request for funds to ensure that all statistical analyses are up to snuff. One step forward has been the introduction of statistical requirement standards for authors by a number of biomedical journals, along with stricter publishing guidelines.
Faulty Reagents Doom Many Experiments
This isn’t a problem in fields like psychology, but it’s notorious in biology labs. For example, researchers should budget money to confirm the identity of cells and/or mice used in their experiments. Cell lines are often mixed up in labs, with those that grow especially well (like the famous HeLa cell line). These will eventually outnumber and replace the poorly growing ones the investigators thought they were studying. This problem is not a small one; a recent paper suggested that more than 30,000 scientific articles describe work done using the wrong cells. Researchers should always confirm the identity of the cell line(s) they’re using, but a “see no evil” attitude and ignorance of the problem often prevents this from happening.
Mis-identification issues extend beyond cell lines to lab animals. I remember when researcher Brenda Kahan at the University of Wisconsin-Madison sued Charles River breeding labs for selling her genetically impure mice that were not what they were supposed to be. When the problem was discovered it led to the invalidation of her experimental results, which, according to her lawsuit, lessened the probability that she would be promoted and be able to attract additional grant money.
Commercial reagent suppliers should pay independent researchers to quality-check their products. Companies should guarantee that antibodies truly bind to what they’re supposed to bind to, and reagents (like growth factors) can’t be contaminated with other proteins. Years ago experiments were ruined in a number of hematology labs when a commercial source of GM-CSF was (eventually) found to be contaminated with a distinctly different growth factor, erythropoietin.
Outdated software also serves as a significant source of research irreproducibility. Investigators using programs that are years out of date may generate findings that differ substantially from data generated using new and improved software. Getting scientists to discontinue using outdated software is problematic for those who wrote and updated the original code. The message gets sent, but many do not hear it.
At Least DNA Sequencing Results Can Be Trusted, Right?
Wrong! In the era of “precision medicine”, oncologists are hoping that gene sequencing of cancer patient tumor samples will help guide their treatment decisions. The idea is to identify mutations in genes that are “actionable”; i.e. those for which there are currently drugs that target that particular site. For example, this could mean inhibiting the enzymatic activity of a mutated protein. DNA sequencing has to be extremely accurate and reproducible for this approach to work. In 2016 researchers sent samples from nine cancer patients to two commercial next-generation sequencing platform companies. It was expected (maybe hoped for is the better term) that the results obtained from the two companies would be identical. That’s not what was found. Of 45 genetic mutations identified, only ten were picked up by both tests. Even worse, the tests often disagreed; for two patients there was no concordance at all between their results. The tests used different methods to sample the DNA. It’ll be difficult for doctors to count on these types of analyses to guide their actions until these issues get fully sorted out.
Are Results From “Liquid Biopsies” More Accurate?
Sadly, there are reproducibility problems here too. A liquid biopsy is a new technique being developed to identify patients with cancer before they actually have physical symptoms. The idea is to catch the cancers at their earliest stages so that treatment can begin before the cancer spreads, and when it is (theoretically) more curable. It also eliminates the need to do invasive biopsies, which come with significant risks and costs. The samples tested are not prepared directly from the tumors, but from patient’s blood. The goal is to find tumor cells (or shed DNA) in the bloodstream, which can then be analyzed to find evidence of any tumors present. Investigators in a recent study sent serum samples from patients to two different labs to test out the accuracy and reproducibility of their liquid biopsies. Once again, separate labs (both of which were CLIA accredited) running the exact same samples obtained disparate results. These dissimilar findings could lead to distinctly different treatment recommendations. A much larger study provided more encouraging data, but still suffered from issues of false positives as well as false negatives. Read here for an interesting discussion of the potential problems with liquid biopsies that need to be overcome before they’re ready for prime time.
Outright Fraud Is On the Rise
Sadly, examples of papers with fraudulent underpinnings are not difficult to find, and the incidence of these duplicitous articles appears to be increasing. I say appears because it’s not clear if they’re really increasing in number, or if we are simply much better now at detecting fraud using computer programs and other tools. What kinds of problems are we talking about? Plagiarism. Faked photographs. Made-up studies. Non-existent patients. Fudged statistics. Examples can be found for all of these loathsome practices. Misconduct, an analysis by the journal PNAS found, is the leading cause of retracted science papers. Fake journals, where academic publications and credentials are essentially up for sale, are another source of woe for legitimate scientists and the public in general. These journals exist with no real peer review systems in place and make money via the charging of fees. At the end of the day, though, I worry a little less about these types of truly fraudulent behavior than studies that are meant to be done properly, but don’t turn out that way. The error-filled papers vastly exceed those executed in a cloud of mendacity and false achievement claims. Finding their flaws can be much harder than detecting simple plagiarism.
Is Paid 3rd Party Data Confirmation Possible?
Maybe scientific enterprises need to adopt some of the practices of various types of organizations. They hire outside firms to challenge their network systems and determine just how robust they really are. These tests reveal real world problems. One widely reported example: the TSA had their security screening protocols tested a few years ago at airports nationwide. Members of the “Red Team” were able to smuggle fake guns and bombs through airport checkpoints an astonishing 95 percent of the time. Despite the supposed implementation of numerous changes following this fiasco, a more recent test of airport security protocols still produced a failure rate of over 70 percent. The additional testing and vetting appears to be mere window dressing that’s made only a small dent in a large problem. If problems aren’t recognized then they aren’t fixable.
So what’s being done to address what is clearly a wide spectrum of reproducibility problems? There’s a new online platform called StudySwap (hosted by the Center for Open Science) that aims to facilitate replication studies between labs. However, its focus appears to be solely in psychology research, a field where previous replication efforts have revealed significant shortcomings. It won’t help those in the hard sciences. Let’s look at the results obtained by the Reproducibility Project: Cancer Biology, an effort to replicate specific experiments. This is a collaboration between the for-profit Science Exchange and the Center for Open Science, and is hosted on the Open Science Framework. The results were published by eLife. The overall results were scored by compiling the detailed descriptions of each replication attempt. Only 29 papers have been analyzed to date (the project started in 2013), and the results from only nine of them have been published. The results indicated that for many of the papers the replication effort “reproduced important parts of the original paper.” However, not all of the data in some of the papers could be replicated, and statistically significant findings in the original paper were not always found to be significant in the repeat experiment. More confounding were results in some studies that “did not reproduce those experiments” it attempted to replicate. In several other cases the “results in this Replication Study could not be interpreted.” My takeaway from all of this: we have a reproducibility problem, which was what led to this work being done in the first place. Or to put it another way: problematic data were shown to be mostly problematic a second time.
Two obvious shortcomings jump out from these results. First, only some of the experiments were subjected to replication efforts for each paper, so not all of the data were examined. The bigger issue was that the number of papers published each year (estimated to be about 2.5 million) dwarfs the scale of these replication efforts by five orders of magnitude. Obviously not all papers need to have their accuracy tested by formal replication studies, but this illustrates just how impractical a “paid replication approach” would be in solving the problems of data irreproducibility.
Many of the suggestions I’ve read that address these problems sound good in theory, but are devilishly hard to put into practice. For example, the Royal Netherlands Academy of Arts and Sciences recommended that journals register scientific reports in advance to “lock in” their protocols and plans. That might happen for clinical research, but I don’t see it working for basic research. Their suggestion that five to ten percent of research funding be spent on replication studies is also a tall order. On the other hand, their proposal that institutions should put a greater emphasis on training scientists in statistical analysis and research design sounds more reasonable. Part of the problem, of course, is that institutions don’t train scientists, scientists do. So independent faculty members are going to need to organize themselves to teach courses on this subject, or develop plans to impart this info within their labs. That, too, is going to be challenging, but it’s potentially doable.
Another approach that’s been taken to battle bad data is to establish a “critical incident” reporting system (LabCIRS) through which individuals can anonymously report problems with experiments or data. This software-based approach has been borrowed from clinical medicine and is scalable. At first glance, and within an ideal population, the concept would appear to have merit. Think of it as a “me, too” movement for calling out suspect science data. The idea is to foster a culture of accountability within a lab group, department, or institution. Implementation, however, is likely to present a number of challenges. As with any anonymous reporting system, the approach seems ripe for abuse as individuals file challenges against data generated by colleagues against whom they have some type of grudge. Who would the arbiter be, and how would they decide what type of response is required for each particular complaint? Finally, how would problems found (e.g. identification of a bad reagent generated by a collaborating lab or commercial supply company) be clearly communicated to the broader research community?
Changing Behaviors is Hard
Getting people to alter their behaviors is maddeningly difficult, even when doing so seems straightforward and requires little effort. Training medical personal to wash their hands before touching patients is one such example. This is difficult even though nurses and doctors are well aware of how easily lethal bacterial strains circulate in hospitals. Another example: persuading surgeons to adopt operating room checklists to cut down on mistakes. Surgeon and author Atul Gawande details the challenges seen in implementing this process in his excellent book The Checklist Manifesto: How to Get Things Right.
One thing is crystal clear: improving reproducibility in the halls of science is going to raise costs. The reason is simple. Doing experiments multiple times, employing more statisticians (or increasing training), and hiring 3rd party services to check results will cost more in time, reagents, and personnel. Where will this extra money come from? No one’s saying, and in fact President Trump, heading in the opposite direction, wants to stop paying institutional overhead in grants from the NIH.
The “lack of reproducibility” problem is a clear and present danger to our scientific enterprise. However, we need to make sure that efforts to resolve this issue don’t make the system worse by implementing solutions that are poorly vetted. Six years ago I opined that the Reproducibility Initiative was “a good idea in theory that won’t work in practice.” Two years later, I revisited the issue and explored the wide spectrum of problems caused by poorly designed experiments with flawed statistical analyses. Nothing I’ve seen since then has changed my mind. As with “waste, fraud, and abuse”, data irreproducibility in science continues to confound the efforts of those interested in resolving this problem. Coming up with a workable solution isn’t rocket science. It’s much, much harder.
Update 2-16-18: Artificial intelligence also faces a reproducibility problem, much of which is related to a failure of researchers to share their coding algorithms.
Thanks for reading BioPharma Observer! Subscribe for free to receive new posts and support my work.