Notes: Making Data Science work for Clinical Reporting - Part 3

Clinical trial

Data science

Reporting

This is the Part 3 of a four-part course on Coursera. In this part, innerSource and OpenSource concepts are introduced, and R package development is discussed.

Author

Chi Zhang

Published

February 27, 2023

This is a course provided by Genentech (part of Roche) on Coursera.

Course link

InnerSource and OpenSource

InnerSource: use of open source software development practices and open source-like culture (even though the software developed might still be proprietary)

When to OpenSource? Which license to use?

MIT license is the most permissive one

R package development

R packages are useful for

Reusability by other users, developers, “future me”
Robustness: well-tested, maintained for a longer period of time
Encapsulation: hidden complexitty inside the package (internal functions), stable interface exposed to the users (exported functions)

What can ben encapsulated in an R package?

R functions (internal, exported)
Tests (unit tests, testthat package)
Data (raw, processed)
Analytical code (reproducible analysis and reporting)
Text (literate programming): markdown, Rmarkdown, Quarto, Vignettes
Interactive applications (shiny)
Web APIs (with plumber)

Types of R packages: tool based, methods, analytical, web data project

Principles and tools

Reproducibility: Git (code versioning), dependencies (renv for r package dependencies, Docker for system dependencies)

Clean code

Code comments: not recommended! Better to write code in a way that does not need additional comments.

DRY: don’t repeat yourself (principle of software development), avoid copy and paste everywhere.

SRP: single-responsibility prinicple, a function should do one thing: either plot a chart, saves a file, changes variables etc, but not all.

Naming conventions

Reserve dots (.) for S3 methods (print.patient)
Reserve CamelCase for R6 classes or package names (OurPatients)
Use snake cases (all_patients) for function names and arguments, use verb noun pattern (plot_this())

Code smells

A function might be too large: break into smaller ones (e.g. could fit in one screen)

A function violates SRP: break into smaller ones, and be explicit in what result it is expected to return

A function with multiple arguments: the scenarios to be tested increase rapidly. Recommended to minimize number of critical function arguments, and break the function into smaller ones.

Bad comments in the code: drop the unnecessary, unclear, outdated comments, write code that are self-explanatory.

Development workflow

Code refactoring: change existing code without its functionality

TDD: Test-Driven Development

start with writing a new (failing) test
write code thtat passes the nenw tetst
refactor the code
and repeat

Benefits: your code is covered by tests; you think of testing scenarios first; “fail fast” - can immediately repair the code; more freedom to refactor (improve) the code.

How to test

automatically: CI/CD, after pushing Git commits
manually:
- run all unit tests in the package (Build / Test package)
- run tests in a selected test file (Run Tests)
- run a single test in Rstudio console

How to check

R CMD CHECK

Writing robust statistical software

Implement complext statistical methods such that the software is reliable, and includes appropriate testing to ensure high quality and validity and ultimately credibility of statistical analysis results.

choose the right method and understand them
solve the core implementation problem with prototype code

Need to try a few different solutions, compare and select the best one. Might also need to involve domain experts.

spend enough time on planning the design of the R package

Don’t write the package right away; instead define the scope, discuss with users, and design the package.

Start to draw a flow diagram, align names, arguments and classes; write prototype code.

assume the package will evolve over time

Packages you depend on will change; users will require new features

Write tests

unit tests
integration tests

Make the package extensible

consider object oriented package designs
combine functions in pipelines

Keep it manageable

avoid too many arguments
avoid too large functions

CI/CD for R packages

Continuous Integration: tests code changes to maintain the integrity of the codebase

Continuous Delivery: deploy artifacts (such as an R package) to target systems

Key components

Dependency management

Install dependencies (system/OS level; R packages)

Set repos (can be specified in options()) to e.g. CRAN, BioConductor
renv
container with dependencies pre-installed

Static code analysis

Linting (for programmatic and syntax errors) via lintr package
Code style enforcement via styler package
Spell checks identifies misspelled words in vignettes, docs and R code via spelling package

Testing

R CMD build builds R packages as a installable artifact
R CMD check runs 20+ checks including unit tests, reports errors, warnigns and notes
Test coverage reports with covr, checks how many lines of code are covered with tests
R CMD INSTALL tests R package installation

Documentation

Auto-generated docs via Roxygen and pkgdown

Release and deployments

Release artifacts and deployments to target systems

Changelog (features, bug fixes) in the NEWS.md
Release: create the package with R CMD build. Validation report with thevalidatoR
Publishing: CRAN, BioConductor