Notes: Making Data Science work for Clinical Reporting - Part 3
This is the Part 3 of a four-part course on Coursera. In this part, innerSource and OpenSource concepts are introduced, and R package development is discussed.
This is a course provided by Genentech (part of Roche) on Coursera.
Course link
InnerSource and OpenSource
InnerSource: use of open source software development practices and open source-like culture (even though the software developed might still be proprietary)
When to OpenSource? Which license to use?
- MIT license is the most permissive one
R package development
R packages are useful for
- Reusability by other users, developers, “future me”
- Robustness: well-tested, maintained for a longer period of time
- Encapsulation: hidden complexitty inside the package (internal functions), stable interface exposed to the users (exported functions)
What can ben encapsulated in an R package?
- R functions (internal, exported)
- Tests (unit tests,
testthat
package) - Data (raw, processed)
- Analytical code (reproducible analysis and reporting)
- Text (literate programming): markdown, Rmarkdown, Quarto, Vignettes
- Interactive applications (shiny)
- Web APIs (with
plumber
)
Types of R packages: tool based, methods, analytical, web data project
Principles and tools
Reproducibility: Git (code versioning), dependencies (renv
for r package dependencies, Docker for system dependencies)
Clean code
Code comments: not recommended! Better to write code in a way that does not need additional comments.
DRY: don’t repeat yourself (principle of software development), avoid copy and paste everywhere.
SRP: single-responsibility prinicple, a function should do one thing: either plot a chart, saves a file, changes variables etc, but not all.
Naming conventions
- Reserve dots (.) for S3 methods (
print.patient
) - Reserve CamelCase for R6 classes or package names (
OurPatients
) - Use snake cases (
all_patients
) for function names and arguments, use verb noun pattern (plot_this()
)
Code smells
A function might be too large: break into smaller ones (e.g. could fit in one screen)
A function violates SRP: break into smaller ones, and be explicit in what result it is expected to return
A function with multiple arguments: the scenarios to be tested increase rapidly. Recommended to minimize number of critical function arguments, and break the function into smaller ones.
Bad comments in the code: drop the unnecessary, unclear, outdated comments, write code that are self-explanatory.
Development workflow
Code refactoring: change existing code without its functionality
TDD: Test-Driven Development
- start with writing a new (failing) test
- write code thtat passes the nenw tetst
- refactor the code
- and repeat
Benefits: your code is covered by tests; you think of testing scenarios first; “fail fast” - can immediately repair the code; more freedom to refactor (improve) the code.
How to test
- automatically: CI/CD, after pushing Git commits
- manually:
- run all unit tests in the package (Build / Test package)
- run tests in a selected test file (Run Tests)
- run a single test in Rstudio console
How to check
- R CMD CHECK
Writing robust statistical software
Implement complext statistical methods such that the software is reliable, and includes appropriate testing to ensure high quality and validity and ultimately credibility of statistical analysis results.
- choose the right method and understand them
- solve the core implementation problem with prototype code
Need to try a few different solutions, compare and select the best one. Might also need to involve domain experts.
- spend enough time on planning the design of the R package
Don’t write the package right away; instead define the scope, discuss with users, and design the package.
Start to draw a flow diagram, align names, arguments and classes; write prototype code.
- assume the package will evolve over time
Packages you depend on will change; users will require new features
Write tests
- unit tests
- integration tests
Make the package extensible
- consider object oriented package designs
- combine functions in pipelines
Keep it manageable
- avoid too many arguments
- avoid too large functions
CI/CD for R packages
Continuous Integration: tests code changes to maintain the integrity of the codebase
Continuous Delivery: deploy artifacts (such as an R package) to target systems
Key components
Dependency management
Install dependencies (system/OS level; R packages)
- Set
repos
(can be specified inoptions()
) to e.g. CRAN, BioConductor renv
- container with dependencies pre-installed
Static code analysis
- Linting (for programmatic and syntax errors) via
lintr
package - Code style enforcement via
styler
package - Spell checks identifies misspelled words in vignettes, docs and R code via
spelling
package
Testing
R CMD build
builds R packages as a installable artifactR CMD check
runs 20+ checks including unit tests, reports errors, warnigns and notes- Test coverage reports with
covr
, checks how many lines of code are covered with tests R CMD INSTALL
tests R package installation
Documentation
Auto-generated docs via Roxygen
and pkgdown
Release and deployments
Release artifacts and deployments to target systems
- Changelog (features, bug fixes) in the
NEWS.md
- Release: create the package with
R CMD build
. Validation report withthevalidatoR
- Publishing: CRAN, BioConductor