Using Docker and Knitr to Create Reproducible and Extensible Publications
David Mawdsley, Robert Haines and Caroline Jay
C4RR Workshop, Cambridge. 27 June 2017
The problem
- The scholarly publication process is slow and lumpy
- But scientific knowledge is incremental
- And there are always more questions than there’s time to answer
- How can Docker and reproducible research help?
Outline
- Background to reproducible research
- Where Docker fits into this
- Benefits and challenges
Reproducible Research

Reproducible Research in R
knitr
allows us to interleave markdown or LaTeX
with R
code
- R session persists throughout document
- Can produce something that looks identical to a “normal” paper
- Paper source needed for reproducibility
Reproducible != Reusable
- Reproducibility is a good thing
- It makes you do things properly
- It lets others check your work
- It lets others repeat your work
- It doesn’t (necessarily) make it easy to reuse or extend your work
Docker Pipelines for Reproducible, Reusable Research
- By breaking our analysis pipeline into sections we obtain a more flexible and modular workflow.
- makes incremental improvement / extension of the work easier
- Docker facilitates this
- Use a Makefile to handle dependencies between “modules”
- The manuscript is part of the analysis pipeline
Example - IDInteraction
- Automate the coding of behaviours
- This is really slow and tedious to do by hand.

Docker images
- Each module contains its own Makefile
- Example: object tracking

Docker images
- Each module contains its own Makefile
- Example: object tracking

Docker images
- Each module contains its own Makefile
- Example: object tracking

Top-level Makefile
- Handles dependencies between the Docker image modules
- Calls the final Docker image to produce manuscript
The paper as software
- Treating the paper as “just another part” of the software development process lets us use:
- Version control
- Continuous integration
- Unit testing
Recommendations
- Keep intermediate analysis steps
- Avoids wasted work in slow modules
- Make the Knitr paper runable on native system
- Allows interactive writing / analysis
- Test textual assertions with
R
code:
“The accuracy doubled when we used the new procedure”
if(acc.new < 2*acc.old)
warning("Accuracy assertion failed")
Challenges
- Extra overhead
- Minimised if working reproducibly from the outset
- Manuscript container could be shared
- Good “glue” to streamline workflow is important
- Working offline / collaboratively (e.g. Overleaf, Google Drive)
Challenges are minimised if you research reproducibly from the start - all you need is the “glue”
Benefits
- Each module (Docker image) can be used independently of the others
- Re-usability and reproducibility
- Can trace each figure in the paper back to its source
- Readers can fully understand methodology
- Paper is self consistent
- Could reshape publication process
The future of academic research outputs
- Should something like an academic paper be the “standard” research output?
- Is this better than, e.g. a Jupyter notebook? If so, why?
- Audience; who reads it and why?
- Researcher evaluation/assessment; cultural shift, relative (perceived) values of, e.g. notebooks vs journal articles
- It seems likely that academic papers will be the primary means of scientific dissemination / credit for a while
Conclusions
- Knitr makes writing a reproducible paper within R easy
- Complexity comes from leaving R ecosystem
- Docker reduces this complexity and makes the paper extensible
- Makes the analysis pipeline repurposable
- Extra work minimised if process embedded from the outset
Bibliography/further reading