Image is a section from an image of Andra Mihali. Source below.

On the topic of (ir)reproducible research

An article by Freedman et al.1 found that different studies estimate the percentage of irreproducible preclinical research to be ranging from 51% to 89%. They found that the main drivers of irreproducibility are the study design, faulty reagents, insufficient protocols, and ill-chosen methods for data analysis.

Freedman et al. 2015

The introductory text of a collection of articles2 regarding reproducibility of research provided by the Nature journal says:

Science moves forward by corroboration – when researchers verify others’ results. […] There is growing alarm about results that cannot be reproduced. Explanations include increased levels of scrutiny, complexity of experiments and statistics, and pressures on researchers.
Nature 2

In a study from 2016, 90% of all participants (n=1576) agreed that there is a least a slight reproducibility crisis and only 73% think that at least half of the results in their field can be trusted3.

There are even companies emerging that try to offer services that facilitate reproducibility by increasing the quality standard 4. They boil down the essence of the previous two findings on their homepage:

Recent publications showed some worrying results about the quality in research and point towards a major issue in the scientific field: The lack of being able to reproduce research data. This issue seems to be so prominent that the phrase “Reproducibility Crisis” came up because 50 -90% of data could not be reproduced. Interestingly, according to a survey published in Nature shows that researchers are aware!
PAASP 4

In this article I will first explain what reproducible research is, then show some reasons why sometimes findings are not reproducible, and in the end I will point to some strategies and values that I think are obligatory to achieve reproducible science.

What is reproducible research?

DNA as symbol for replication

The definition on what reproducible actually means is not standardized among scientists. For some, especially in it is that running the same computer code on their data multiple times, the results will be the same. For others it means that using the same methodology yields the same conclusions 5.

So, what do I think is reproducible research? I think that there is some merit to both lines of thought. Reproducibility is divided into several sub-categories: (1) replicability of the concrete results (2) replicability of the proposed method and (3) general replicability of the results with data acquired in the same conditions as described in the publication.

(1) Replicability of concrete results

I think replicability of the concrete results is the minimal standard everybody that publishes a paper should adhere to. Alongside every paper there should be the raw data as well as the processing and evaluation scripts. In social sciences, for instance this could mean that the gathered data (e.g. from questionnaires), as well as the evaluation script (as i.e. R, STATA, SPSS) are published in the supplemental materials. In other fields of science this could be implemented analogous by publishing the acquired data and evaluation scripts. This has the advantage hat it would prompt authors to place value on sound methodology and high data quality. As most scientific research is payed from government funding, it is only fair to provide the data and code and an opportunity to recalculate the results from the data for every citizen.

(2) Replicability of method

The next level of reproducibility is that the method needs to be re-implementable just from the information in the paper and supplemental materials. Depending on the field and the topic this will implicate a certain level of effort from the authors or will be limited by the availability of specialized or expensive machines. However, a concise description of all parts and details of the method is a logical step for good science. An easy replicability of the method would make it easy for other researchers to compare their own work to the original publication or encompass previous work in their own publication. It could be ensured by publishing e.g. measurement protocols and computer code alongside the paper. A good sanity check if a method is reproducible is to have a colleague from the lab re-implement the method or in the best case replicate the experiment with just the information from the paper in their hands.

(3) General Reproducibility

The highest level of reproducibility is when a method proves to work under different conditions and over a long period of time. This can hardly been done by publishing a single paper. While on the one hand a very thorough validation is important, sometimes it is also important to publish brilliant ideas, where the initial results are inconclusive. Rather, it is the job of the scientific community to reproduce the results and to conduct follow-up studies on the original publication. So while (1), the ability to recreate the same results, and (2), the ability to reproduce the method, have to be taken care of by the authors of the publication, I believe that (3), validating general reproducibility, is also a community effort of all scientists from the respective field.

What is the issue?

What is wrong?

Now, at least adhering to the minimal standard of being able to reproduce the results does not sound too hard. So what is the problem? The three issues discussed in the following section were among the top 10 most frequent reasons why research in some cases is not reproducible in an article by Monya Baker 3 from 2016.

Pressure to publish

As many other things in modern life, science is all about deadlines. At least half a year before a conference the contributions need to be written down and submitted. Also, in many cases there are strict timings for projects when they are financed from publicly funded grants. This can lead to a lower quality standard and to an unease of the author to publish his code and data with the world.

Also there is high pressure especially on upcoming scientists to publish original research with excellent positive findings, as it is perceived to be quite hard to publish negative results or reproduction studies. This points leads over to the next problem, which is a strong bias in the reported findings.

Selective reporting

Dwan et al. 6 published an article in 2008 in which they did a meta analysis of a series of cohort studies and confirmed that there is a correlation between significance of results and publications. They found that statistically significant results have higher odds of being reported in full, while positive results are more likely to be published at all. This leads to cases where code and data will not be made available, especially in cases where the overall quality is not that high.

There have been efforts to overcome the publication bias. However, Journals that focus specifically on negative results have not had large success. In 2014 Elsevier launched a pilot open access journal New Negatives in Plant Science7 and published an article8 on why publishing negative results is equally important. However, it was discontinued in September 2016. Other journals have suffered the same fate like the Journal of Negative Results in BioMedicine9 that ceased to be published as of September 2017, or the Journal of Negative Results - Ecology & Evolutionary Biology 10, which seems to only have published three articles in the last three years, as well as the Journal of Articles in Support of the Null Hypothesis that as of August 2018 has not published an article for over a year.

Low statistical power

“To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”
Ronald Fisher

As a concrete example for the quote of Ronald Fisher, Allison et al. 11 list so called cluster-randomized trials, in which all subjects in a cluster are treated the same. In the statistical evaluation of these trials, the number of clusters must be included and not just the number of subjects. Else, results will sometimes seem to be statistically significant - even though they are not.

In 2015 Head et al.12 examined The Extent and Consequences of P-Hacking in Science . And while their results suggest that “p-hacking does probably not drastically alter scientific consensuses drawn from meta-analyses”, they also say:

Quantifying p-hacking is important because publication of false positives hinders scientific progress. When false positive results enter the literature they can be very persistent. In many fields, there is little incentive to replicate research.
Head et al. 2015 12

Conclusion

There seems to be a problem in the scientific community of research not being reproducible due to a variety of reasons. The pressure to publish and the bias to publish positive results have been discussed as potential reasons for this crisis. However, I think it might not be the best idea to put a big effort make it more acceptable to publish negative or inconclusive results in peer reviewed journals. However, to reduce the negative effects of publication bias, these kind of results need to be findable and citeable in other research work. One way of achieving this could be to always publish manuscripts on preprint servers like arXiv13.

Scientists should team up!

In general, I would love to see a push towards open science and open data in all scientific fields. Currently, the biggest pioneers in this regards are the researchers from the field of artificial intelligence and machine learning, where a lot of success and acceleration in the field can be attributed to the fact that it is the de facto standard in this field to always publish the relevant code and data sets. This way, results can be reproduced easily and fast progress can be made in the field. As a colleague of mine once said:

When there is a minimal standard that can improve the current practice without practically any extra effort - there is no excuse not to do it.
TK 2018

Title image source: Image from Andra Mihali

Janek Gröhl
Janek Gröhl
Data Science, Digital Twins, Deep Learning, Photoacoustic Imaging

Janek Gröhl is a data scientist who conducts research towards quantitative photoacoustic imaging.