Problem Statement

R’s great strengths as a facilitator of reproducible analysis are due to packages that provide access to literate programming (e.g rmarkdown, knitr). To solve other aspects of the reproducibility problem, the R community has built a culture of best practice that encourages the use of version control, automation, and software testing. Packages that support these have also developed (e.g. devtools. remake, testthat, rhub,githug).

If we consider the facets of reproducibility outlined in the rOpenSci Guide, R is currently weak in the area of Provenance. By that I mean, creating provable lineages for data, code, model, and results objects that ensure none can exist in invalid or unexpected states.

This weakness manifests itself in several reproducibility traps one can encounter, particularly when working with teams of analysts. The subject of this proposal is for a data, code, and results journalling tool that can be used to identify and avoid these reproducibility pitfalls.

Reproducibility Traps for Teams with the Best of Intentions

Local copies. They are a the crux of the issues described here. Despite the best of intentions, a team using literate programming with version control best practice can still be fall afoul of them. In my personal experience this is for two main reasons, driven separately by data and code.

Data Copies

Data of moderately large size cannot usually be stored in git repositories. This is because repository size can easily blow out to uneconomical levels when tracking changes of large files. Best practice is to store data inputs for analysis in a remote data store (perhaps with data version control). All members of the team can access the same data file which can act as a ‘single source of truth’ for that data.

Unfortunately when working with moderately large data files (e.g. > 100MB) local copies become necessary due to latency induced by copying data over a network. As soon as a local copy is made, the same latency issue will mean any changes or updates to the data are slow to propagate back to the data store.

Code Copies

Local copies of code are the norm when using distributed version control. Even for teams following strict versioning practices, confusion can arise about the code lineage of results due to the existence of working branches where experimental updates are tested before being pulled to the master branch.

When it comes time to disseminate results of the analysis via a literate programming document (word, pdf, html), the document provides no information (verifiable or otherwise) about the version of the code used to produce the analysis. Code inconsistencies are extremely difficult to detect looking at results in this form. There is also no way to verify the output was indeed generated by a particular source file even if that file is sitting beside it in a repository.

Problem Manifestation

Although the examples make reference to teams, they can just as easily arise in solo work. The problems caused by the necessary use of local copies will manifest themselves as inability to reproduce results leading to uncertainty and debate about the validity of all aspects of the analysis. For analysis that is non-deterministic, for example involving random samples, the possibility can never be fully discounted that analysis discrepancies are due to sample variation. In fact this possibility creates an ‘easy-out’ that has in my own experience caused discrepancies due to underlying code and data to go unnoticed for too long.

Proposed Solution

Here I propose a package intended to provide lightweight, flexible support for data, code and result provenance with minimum change to typical workflows. Design is high level at this stage, no implementation has begun.

Journalling

The proposed package would allow teams to instantiate a journal file object which lives in the root level of a project directory. The journal would have an R API that would allow registration and verification of data and results. Some fictional usage is shown below. The verbs used in the API are to be decided upon and are very much up for discussion.

Datasets

For example:

register(source = "//datastore/project/datafile.csv", 
         method = read_csv, 
         name = "data_file", 
         immutable = TRUE)

Would source a file from the data store, tabulating summary statistics for later verification. The record would be stored with the name in code, in this case data_file. The journal record could be made immutable, forcing analysts to register new names for updated data sets.

Other things that could be registered in the journal:

  • Data objects representing useful constants
  • Package functions/namespaces (to later verify functions are not masked)
  • S3/S4 R objects.

Later when a local copy is read in or about to be analysed, it can be verified against the journal using stored metadata with:

verify(data_file)

You could also perform more expensive verification of the remote source, journal record, and local copy:

verify_remote(data_file)

Interaction with Dat and data version control

The Dat protocol could be used as a remote sourcing method for registration. In that case verify() may be able to get away with very little work, instead leveraging the Dat API.

So far I have not imagined that data version control is a requirement for the journalling workflow. Certainly it would be advisable. The journalling workflow and data version control are complementary. If DVC is in place it will make resolving issues faster after the journal verification tools have raised the alarm. Due to this, you can view the journal as a link source and data version control.

Data Diffing

The ability to deterine diff data is not a prerequisite for verifying it. Diffing is a desireable validation failure diagnostic.

Code

Code execution journalling could be supported using an expression wrapper:

journal("Modelling A", {
  model_fit <- lm(data = data_file,
     formula = A ~ B + c)
   record(formula = model_fit$call, fit = glance(model_fit), score = RMSE(model_fit))
})

Before execution the code can be examined for validity using R’s non-standard evaluation. For example journal could:

  • Check objects registered in the journal are not redefined anywhere
  • verify() all objects that are registered in the journal, optionally verify_remote().
  • Check no uncommitted/unpushed changes to analysis repository.

During execution the record method collects data to store in the journal entry. You could imagine it employing functions from broom and modelr to do so. Other things one might want to record:

  • plot images.
  • Serialised S3/S4 objects

Post execution an entry would be added to the journal containing the information placed on record() along with useful metadata like:

  • The username.
  • A session or execution identifier to identify code executed as blocks
    • To allow filtering of journal entries that are related.
  • The time executed and time taken to execute.
  • The registered objects that were verified.
  • The repository branch and version.
  • Warnings encountered.
  • The code executed (optionally)
  • Contents of local environment (optionally).

Literate Code

The creation of journaled chunks is naturally suggestive of some type of special type of knitr chunk for R markdown files. Relevant journal metadata could be rendered along with the code chunk. Example syntax:

  ``{r, journal = "Modelling A"}
    model_fit <- lm(data = data_file,
                    formula = A ~ B + c)
                    ...
  
  ``

Thanks to Nick Tierney for the suggestion.

Opportunities for Journal Analysis

With such a rich amount of metadata available, analysis of the data could lead to useful project insights. For example the trend of model scores over time might be suggestive of when the analysis has reached the a point of diminishing returns.

Opportunities for Journal Usage

Records in the journal could be read by other methods to provide reproducibility utility. For example:

  • The session identifier could be printed in the document output. Mapping the report (even in hard copy) to a specific set of results, generated by a specific code version, by a specific author, using verified data inputs.
  • Tests that generate alerts or errors if certain journaled items trend in an undesirable direction (e.g. model scores getting worse)

As a record of analyis

With larger projects that go through many iterations of analysis the journalling method can provide a record that is provenance for the current analytical approach. For example a new collaborator or superisor may make a suggestion which you believe has already been sufficiently explored. If you have been journalling all modelling efforts you can easily interrorgate the journal to verify to what extent the proposed analysis was already run.

Analysis can be a difficult thing to discuss in words due to the ‘garden of forking paths’ it presents. The ability to recall analysis details from the journal to build a frame of reference for a conversation about new approaches has the potential to save a lot of time and confusion.

Applied to this use-case the journal is acting much like a traditional labratory notebook.

Feature Breakdown

Looking toward developing this kind of capability there are many ways to break it down. One idea would be in chrnonlogical order:

Estimated Timeline

TODO