Using Docker and Knitr to Create Reproducible and Extensible Publications

David Mawdsley, Robert Haines and Caroline Jay

C4RR Workshop, Cambridge. 27 June 2017

The problem

  • The scholarly publication process is slow and lumpy
  • But scientific knowledge is incremental
    • And there are always more questions than there’s time to answer
  • How can Docker and reproducible research help?

Outline

  • Background to reproducible research
  • Where Docker fits into this
  • Benefits and challenges

Reproducible Research

Image: Wikipedia

Reproducible Research in R

  • knitr allows us to interleave markdown or LaTeX with R code
    • R session persists throughout document
  • Can produce something that looks identical to a “normal” paper
    • Paper source needed for reproducibility

Reproducible != Reusable

  • Reproducibility is a good thing
    • It makes you do things properly
    • It lets others check your work
    • It lets others repeat your work
  • It doesn’t (necessarily) make it easy to reuse or extend your work

Docker Pipelines for Reproducible, Reusable Research

  • By breaking our analysis pipeline into sections we obtain a more flexible and modular workflow.
    • makes incremental improvement / extension of the work easier
  • Docker facilitates this
  • Use a Makefile to handle dependencies between “modules”
  • The manuscript is part of the analysis pipeline
    • just another “module”

Example - IDInteraction

  • Automate the coding of behaviours
  • This is really slow and tedious to do by hand.

Docker images

  • Each module contains its own Makefile
  • Example: object tracking

Docker images

  • Each module contains its own Makefile
  • Example: object tracking

Docker images

  • Each module contains its own Makefile
  • Example: object tracking

Top-level Makefile

  • Handles dependencies between the Docker image modules
  • Calls the final Docker image to produce manuscript

Extensible papers

  • Modularity using Docker makes it easier to extend papers
  • Avoids salami slicing

  • Also allows paper to be built anywhere
  • Allows the publisher/reviewers to check manuscript code

  • Version control makes it obvious what’s been added to the paper → lighter weight peer review

The paper as software

  • Treating the paper as “just another part” of the software development process lets us use:
    • Version control
    • Continuous integration
    • Unit testing

Recommendations

  • Keep intermediate analysis steps
    • Avoids wasted work in slow modules
  • Make the Knitr paper runable on native system
    • Allows interactive writing / analysis
  • Test textual assertions with R code:

“The accuracy doubled when we used the new procedure”

if(acc.new < 2*acc.old)
    warning("Accuracy assertion failed")

Challenges

  • Extra overhead
    • Minimised if working reproducibly from the outset
    • Manuscript container could be shared
    • Good “glue” to streamline workflow is important
  • Working offline / collaboratively (e.g. Overleaf, Google Drive)

Challenges are minimised if you research reproducibly from the start - all you need is the “glue”

Benefits

  • Each module (Docker image) can be used independently of the others
  • Re-usability and reproducibility
  • Can trace each figure in the paper back to its source
  • Readers can fully understand methodology
  • Paper is self consistent
  • Could reshape publication process

The future of academic research outputs

  • Should something like an academic paper be the “standard” research output?
  • Is this better than, e.g. a Jupyter notebook? If so, why?
    • Audience; who reads it and why?
    • Researcher evaluation/assessment; cultural shift, relative (perceived) values of, e.g. notebooks vs journal articles
  • It seems likely that academic papers will be the primary means of scientific dissemination / credit for a while

Conclusions

  • Knitr makes writing a reproducible paper within R easy
    • Complexity comes from leaving R ecosystem
  • Docker reduces this complexity and makes the paper extensible
    • Makes the analysis pipeline repurposable
  • Extra work minimised if process embedded from the outset

Bibliography/further reading