Using Docker and Knitr to Create Reproducible and Extensible Publications

David Mawdsley, Robert Haines and Caroline Jay

C4RR Workshop, Cambridge. 27 June 2017

The problem

The scholarly publication process is slow and lumpy
But scientific knowledge is incremental
- And there are always more questions than there’s time to answer
How can Docker and reproducible research help?

Outline

Background to reproducible research
Where Docker fits into this
Benefits and challenges

Reproducible Research

Image: Wikipedia

Reproducible Research in R

knitr allows us to interleave markdown or LaTeX with R code
- R session persists throughout document
Can produce something that looks identical to a “normal” paper
- Paper source needed for reproducibility

Reproducible != Reusable

Reproducibility is a good thing
- It makes you do things properly
- It lets others check your work
- It lets others repeat your work
It doesn’t (necessarily) make it easy to reuse or extend your work

Docker Pipelines for Reproducible, Reusable Research

By breaking our analysis pipeline into sections we obtain a more flexible and modular workflow.
- makes incremental improvement / extension of the work easier
Docker facilitates this
Use a Makefile to handle dependencies between “modules”
The manuscript is part of the analysis pipeline
- just another “module”

Example - IDInteraction

Automate the coding of behaviours
This is really slow and tedious to do by hand.

Docker images

Each module contains its own Makefile
Example: object tracking

Docker images

Each module contains its own Makefile
Example: object tracking

Docker images

Each module contains its own Makefile
Example: object tracking

Top-level Makefile

Handles dependencies between the Docker image modules
Calls the final Docker image to produce manuscript

Extensible papers

Modularity using Docker makes it easier to extend papers
Avoids salami slicing
Also allows paper to be built anywhere
Allows the publisher/reviewers to check manuscript code
Version control makes it obvious what’s been added to the paper → lighter weight peer review

The paper as software

Treating the paper as “just another part” of the software development process lets us use:
- Version control
- Continuous integration
- Unit testing

Recommendations

Keep intermediate analysis steps
- Avoids wasted work in slow modules
Make the Knitr paper runable on native system
- Allows interactive writing / analysis
Test textual assertions with R code:

“The accuracy doubled when we used the new procedure”

if(acc.new < 2*acc.old)
    warning("Accuracy assertion failed")

Challenges

Extra overhead
- Minimised if working reproducibly from the outset
- Manuscript container could be shared
- Good “glue” to streamline workflow is important
Working offline / collaboratively (e.g. Overleaf, Google Drive)

Challenges are minimised if you research reproducibly from the start - all you need is the “glue”

Benefits

Each module (Docker image) can be used independently of the others
Re-usability and reproducibility
Can trace each figure in the paper back to its source
Readers can fully understand methodology
Paper is self consistent
Could reshape publication process

The future of academic research outputs

Should something like an academic paper be the “standard” research output?
Is this better than, e.g. a Jupyter notebook? If so, why?
- Audience; who reads it and why?
- Researcher evaluation/assessment; cultural shift, relative (perceived) values of, e.g. notebooks vs journal articles
It seems likely that academic papers will be the primary means of scientific dissemination / credit for a while

Conclusions

Knitr makes writing a reproducible paper within R easy
- Complexity comes from leaving R ecosystem
Docker reduces this complexity and makes the paper extensible
- Makes the analysis pipeline repurposable
Extra work minimised if process embedded from the outset

Bibliography/further reading