Notes: Making Data Science work for Clinical Reporting - Part 3

Clinical trial
Data science
Reporting

This is the Part 3 of a four-part course on Coursera. In this part, innerSource and OpenSource concepts are introduced, and R package development is discussed.

Author

Chi Zhang

Published

February 27, 2023

This is a course provided by Genentech (part of Roche) on Coursera.

Course link

InnerSource and OpenSource

InnerSource: use of open source software development practices and open source-like culture (even though the software developed might still be proprietary)

When to OpenSource? Which license to use?

  • MIT license is the most permissive one

R package development

R packages are useful for

  • Reusability by other users, developers, “future me”
  • Robustness: well-tested, maintained for a longer period of time
  • Encapsulation: hidden complexitty inside the package (internal functions), stable interface exposed to the users (exported functions)

What can ben encapsulated in an R package?

  • R functions (internal, exported)
  • Tests (unit tests, testthat package)
  • Data (raw, processed)
  • Analytical code (reproducible analysis and reporting)
  • Text (literate programming): markdown, Rmarkdown, Quarto, Vignettes
  • Interactive applications (shiny)
  • Web APIs (with plumber)

Types of R packages: tool based, methods, analytical, web data project

Principles and tools

Reproducibility: Git (code versioning), dependencies (renv for r package dependencies, Docker for system dependencies)

Clean code

Code comments: not recommended! Better to write code in a way that does not need additional comments.

DRY: don’t repeat yourself (principle of software development), avoid copy and paste everywhere.

SRP: single-responsibility prinicple, a function should do one thing: either plot a chart, saves a file, changes variables etc, but not all.

Naming conventions

  • Reserve dots (.) for S3 methods (print.patient)
  • Reserve CamelCase for R6 classes or package names (OurPatients)
  • Use snake cases (all_patients) for function names and arguments, use verb noun pattern (plot_this())

Code smells

A function might be too large: break into smaller ones (e.g. could fit in one screen)

A function violates SRP: break into smaller ones, and be explicit in what result it is expected to return

A function with multiple arguments: the scenarios to be tested increase rapidly. Recommended to minimize number of critical function arguments, and break the function into smaller ones.

Bad comments in the code: drop the unnecessary, unclear, outdated comments, write code that are self-explanatory.

Development workflow

Code refactoring: change existing code without its functionality

TDD: Test-Driven Development

  • start with writing a new (failing) test
  • write code thtat passes the nenw tetst
  • refactor the code
  • and repeat

Benefits: your code is covered by tests; you think of testing scenarios first; “fail fast” - can immediately repair the code; more freedom to refactor (improve) the code.

How to test

  • automatically: CI/CD, after pushing Git commits
  • manually:
    • run all unit tests in the package (Build / Test package)
    • run tests in a selected test file (Run Tests)
    • run a single test in Rstudio console

How to check

  • R CMD CHECK

Writing robust statistical software

Implement complext statistical methods such that the software is reliable, and includes appropriate testing to ensure high quality and validity and ultimately credibility of statistical analysis results.

  1. choose the right method and understand them
  2. solve the core implementation problem with prototype code

Need to try a few different solutions, compare and select the best one. Might also need to involve domain experts.

  1. spend enough time on planning the design of the R package

Don’t write the package right away; instead define the scope, discuss with users, and design the package.

Start to draw a flow diagram, align names, arguments and classes; write prototype code.

  1. assume the package will evolve over time

Packages you depend on will change; users will require new features

Write tests

  • unit tests
  • integration tests

Make the package extensible

  • consider object oriented package designs
  • combine functions in pipelines

Keep it manageable

  • avoid too many arguments
  • avoid too large functions

CI/CD for R packages

Continuous Integration: tests code changes to maintain the integrity of the codebase

Continuous Delivery: deploy artifacts (such as an R package) to target systems

Key components

Dependency management

Install dependencies (system/OS level; R packages)

  • Set repos (can be specified in options()) to e.g. CRAN, BioConductor
  • renv
  • container with dependencies pre-installed

Static code analysis

  • Linting (for programmatic and syntax errors) via lintr package
  • Code style enforcement via styler package
  • Spell checks identifies misspelled words in vignettes, docs and R code via spelling package

Testing

  • R CMD build builds R packages as a installable artifact
  • R CMD check runs 20+ checks including unit tests, reports errors, warnigns and notes
  • Test coverage reports with covr, checks how many lines of code are covered with tests
  • R CMD INSTALL tests R package installation

Documentation

Auto-generated docs via Roxygen and pkgdown

Release and deployments

Release artifacts and deployments to target systems

  • Changelog (features, bug fixes) in the NEWS.md
  • Release: create the package with R CMD build. Validation report with thevalidatoR
  • Publishing: CRAN, BioConductor