Reproducibility and information preservation
Measuring reproducibility in computer systems research
Christian Collberg, Todd Proebsting, Gina Moraila, Akash Shankara, Zuoming Shi and Alex M Warren
The reproducibility, or lack thereof, of scientific research is a hot topic recently. Among real scientists, reproducibility of experiments has always been an important goal, although as experimental materials and volumes of data they can generate have grown rapidly, it is becoming more and more difficult to achieve this goal. In principle, the "Methods and Materials" section of a scientific journal article should provide the information needed (at least, for another qualified scientist in the same area) to reproduce the results. In practice, this is not always possible, although one must at least pay lip service to this in order to get a paper accepted. In some domains, however, it is estimated that up to half or more of reported results are irreproducible; however, once a result is published (especially by a high-profile journal) it can become difficult for others to publish results that contradict apparently-established facts.
Among computer scientists, it seems reproducibility has not attained even this haphazard level of observance. Recently, researchers at the University of Arizona undertook a study to evaluate the reproducibility of all papers in several recent conferences and journals. They first categorized this work and discarded "theoretical" and "experimental" papers (no implementation). Of the remaining papers, they attempted to either download a publicly accessible implementation or obtain it by contacting the authors. For those papers for which such an implementation could be obtained, they attempted to build it. They report the results here; the linked technical report gives an extended anecdote of some of the authors' initial attempt to obtain a copy of the code of a system they wished to study, which clearly illustrates how infuriating it can be to try to do research that builds on (or attempts to contradict claims made about it) when the supporting evidence (code) is unavailable and the available papers/reports are incomplete or inconsistent.
Their report starts with the following central assumption:
Reproducibility is a cornerstone of the scientific process: only if my colleagues can reproduce my work should they trust its veracity. In the wet sciences reproducing someone’s experiment can be difficult, often involving expensive laboratory equipment and elaborate processes. In applied Computer Science, however, this should seldom be so: unless esoteric hardware were used, reproducing the work published in a systems conference or journal should be as simple as going to the authors’ website, downloading their code and data, typing “make,” and seeing if the results correspond to the published ones. [Collberg et al., p. 1]While I definitely agree that the Arizona study is important and worthwhile, this assumption is at best only sort of true, because it assumes that there is a fixed "computational environment" that we all share and that will remain valid indefinitely. We don't and it won't.
First let me say what is good about Collberg et al.'s study. It is valuable and timely. I would say that this is an overdue wake-up call, but if so, us computer scientists have already slept till midafternoon: scientists in other disciplines have been aware of the dangers of computationally-based research that is not backed up by reusable data and code at least since 2004, which is when I started working on provenance and started learning about the motivations for it coming from scientists. One of the major (albeit seldom clearly enunciated) goals of provenance research is to facilitate reproducibility. Of course, provenance is only one piece of this puzzle: it is aimed at helping untangle the history / derivation of data involved in a computation, and does not directly address the complementary problem of preserving the computational resources (code, systems, architectures, etc.) needed to reproduce a computation.
In fact, I would argue that the need for digital preservation (both of data and code) has been clear for a long time. One of my first papers was on an attempt to define information preservation, and this was far from the first paper on digital preservation. At the time, I was working with some colleagues in digital libraries at Cornell, namely Bill Arms, Carl Lagoze, and Peter Botticelli, and our goal was primarily to advocate formal techniques (lightweight mathematical modeling) to researchers interested in understanding, say, the tradeoffs between high-resolution scanning and OCR. Our main contribution was a simple framework consisting of:
- a "physical" space $S$ of artifacts that might carry information
- an "information content" space $C$ of the possible information values of the artifacts
- a mapping $I : S \to C$ from physical artifacts to their information content
- actions $\delta$ that can apply (or be applied by the environment) to the artifacts, which might indirectly affect the information contents
Of course, there might also be multiple different $C$'s and $I$'s for a given artifact space $S$, for example, one that interprets a piece of paper as a Unicode string of characters on it, another that interprets it as a bitmap image, etc. Choosing one information space and interpretation is part of deciding what is important about the information contained in an artifact.
The extended version of the paper (linked above) included a section not included in the conference paper in which we went beyond this simple scenario to consider objects/artifacts with more structure. For example, the objects could have different types (e.g. different document types/formats) and they could be acted upon by different operations (format translators, Web browsers, etc.). These operations themselves could also be changing over time, and in general they do (e.g. your broser gets updated every few weeks). The point of this was to model the fact that we generally now don't just read information from an artifact in one step (using our eyes and ears etc.) but instead almost all digital information is mediated by some equipment, code, etc. Nowadays one could add different mobile devices, with their different levels of fidelity to the "real" content, to the mix too. This infrastructure is not fixed once and for all but changes gradually over time.
For obvious economical and technological reasons, no one seems to be proposing that we freeze development of new operating systems, computer hardware etc., and so for the foreseeable future, new and improved hardware and software will continue to replace old. However, most code, and almost all research code, is developed without a clear plan for how to maintain it indefinitely so that it will still work in (say) 10 years --- or even 1 year. It is important to understand that this is not necessarily a failure of the software developer, but may be unavoidable, or at least it may not be clear to the developer how to avoid this (within the usual constraints of research time/budget/sanity). However, Collberg et al.'s assertion that reproducibility for computer systems research "should be as simple as ... downloading code or data, and typing 'make'" is only true under very carefully controlled circumstances. Even the first part, going to the website and downloading something, is far from simple.
How, if at all, can we map the notion of information preservation from our 2001 paper onto recent discussion of reproducibility?
Recall that Collberg et al. defines reproducibility as: one can download the code (either by following a link in the paper), build and run it, and compare the results to the published ones, without having to make nontrivial changes to the code. As far as I can tell they did not always go further in evaluating whether the code supports the actual experimental results of the paper, but let's forget about this for the time being and take it as a given that the ability to rebuild/run the code is at least a prerequisite for "true" reproducibility.
Let's say that the artifacts are simply the bit strings comprising the archive containing code to be built, plus some computational description (e.g. the Makefile). (Nothing forces the artifacts to actually be physical objects). The information is just a Boolean value: does it build?
What does it actually mean to rebuild code? Well, today, one might have several different types of machines, but most academic code is likely to build in a relatively vanilla environment, GNU/Linux, BSD/MacOS or Windows (possibly using Cygwin). Because Collberg et al. chose papers from 2-3 years ago, it is a reasonable assumption that the code should still build on current machines. So, the mapping from the artifacts to the information we want just runs the build script / makefile, and checks whether it yields reasonable results. But this is not a single function - it depends on which machine/architecture/OS/library suite etc. is being used. So there is no single notion of reproducibility based on rebuilding - there are as many as there are machine/architecture/OS/library combinations (i.e. a lot.)
So far, so good; let's just fix a single machine configuration and use it as a benchmark. Of course, it would have been good for the original developers to know this configuration so that they could check that their code worked on it. However, this might not be possible to anticipate indefinitely far into the future. In other words, what is reproducible today on a stock Linux machine (modulo library installation etc.) may not be in 1 year or 10 years.
What about the assumption that rebuilding = reproducibility? Obviously this is questionable. In the domain of automated reasoning a proof written today using theorem prover/checker X may not check using the next version of the prover, but usually this is due to easy-to-fix issues with formats/names of tactics or changes to internal search procedures. In some cases, authors commit to maintaining a working version of the proof (e.g. submissions to the Archive of Formal Proofs) but this requires long-term maintenance. There, the ability to check the proof is the most important thing. For experimental results based on research code, we want more: e.g. it would be nice for "rebuilding" to involve reconstructing the experimental results figures and so on. However, if the experimental results are themselves based on access to special hardware or proprietary software, or even dependent on statistics/visualization packages that are open-source but change over time, this too might not be possible to reproduce in a satisfying way.
Incidentally, another of my first papers was on an XML compression technique; together with the paper I provided an open-source distribution of the code accompanying the paper, which is still available here. I think that doing so increased the impact of my work - the code had to do a number of subtle things that were probably not easy to re-implement from the high-level description in the paper, and the code was used in a number of subsequent papers on XML compression as a benchmark (such as this survey) I have intermittently maintained it, but I have no real clue whether it correctly builds on modern architectures, nor do I have the time/energy to ensure this in the absence of evidence that someone needs it fixed. Fortunately, though, having put the code online and made it open-source means that it's not just my problem: if others want to fix or maintain (or fork) my code they can do so; if they can manage to reproduce an old enough computing environment (e.g. virtualized i386 architecture) to run it in its original habitat then this should be fine too.
Different communities, particularly SIGMOD and OOPSLA, have experimented with formal processes for evaluating artifacts or reproducibility (and archiving the artifacts). Such processes should recognize the benefits of making research software publicly available so that if it becomes useful to others, they can contribute (e.g. take over a code project even if the original authors are no longer willing/able to do so). This should be encouraged and hopefully become the norm for papers whose contribution rests on software development (especially when the paper itself cannot describe all of the subtleties involved in the implementation). However, often there is no major (short-term) benefit to doing this, since such contributions may not be appreciated by others in proportion to the required work. There is currently little stigma attached to failure to make code or data available for published research, though again this may change as funders impose open-access policies on both code and data. Some scientific journals actually retract papers when the results cannot be reproduced, but this never happens in computer science as far as I know; maybe it should.
To summarize, reproducibility is a great goal, and both computer science researchers and other scientists should aspire to it, and there should be more work on how to support development of reproducible software alongside work on provenance for supporting reproducible computations. However, there is no fixed, true definition of reproducibility divorced from transient realities of the computational context: what is reproducible (portable) today may not be tomorrow, in 1 year or 10 or 100. That being the case, I don't believe the cost of indefinitely maintaining research code in the face of arbitrary hardware or software changes is justified, but there is a need for monitoring and clearer guidelines about best to negotiate this tradeoff in a world that does not make it easy. In the longer-term, for software-based research to become reproducible, funders, universities, libraries or journals may need to invest resources in maintaining important research software, and research communities may need to reassess how they assign credit (or disincentivize unprofessional behavior such as that documented in the appendix of Collberg et al.'s paper).