Notes: Making Data Science work for Clinical Reporting - Part 2

Clinical trial
Data science
Reporting

This is the Part 2 of a four-part course on Coursera. In this part, agile and DevOps practices are introduced, along with version control with Git and reproducible R projects.

Author

Chi Zhang

Published

February 22, 2023

This is a course provided by Genentech (part of Roche) on Coursera.

Course link

Agile mindset and DevOps practices

Data science as a new way of thinking

New way of working means

  • leverage standards and automation (CI/CD)
  • adopt new data types quickly, reusing data for multiple purposes, pooling data, data marts
  • open-sourcing and collaborating cross pharma (small, readable, self-tested code)
  • coding for reusability, moving away from single-use programs
  • rapidly re-arranginng re-usable components to meet analytical need at hand

Data scientist need to have hard skills, such as

  • SAS, R, Python, JS, bash
  • cloud, containers
  • CI/CD tools
  • visualisation
  • knowledge of various data types

and also soft skills:

  • collaborative and inclusive
  • transparent and practical
  • creative and proactive
  • asking the right questions
  • able to wear many hats, be more flexible and resilient

Agile

Project management; a mindset: uncover better ways of working, by doing and helping others do it.

1st principle: highest priority is to satisfy the customer through early and continuous delivery of valuable software.

Implementations: Kanban, Scrum, Lean, Extreme programming

Tools:

  • backlog
  • kanban board (not started, in progress, done)
  • WIP (work in progress limit)
  • progress measures: e.g. team velocity

DevOps

Increase efficiency by improving the connection between Dev (software development) and Ops (IT operations).

The goal is continuous delivery and continuous improvement.

Practices:

  • modular architecture
  • version control
  • merge into trunk daily
  • automated and continuous testing, continuous integration
  • automated deployment

DevOps in clinical reporting

Risks around production run:

  • are all dependencies in production?
  • was all quality control completed and successful?
  • is all documentation complete?
  • was the transfer to eDMS correct and successful?

Version control

Feature branch (as opposed to master branch): one task per branch

name feature branch: issue number and description

Each issue should have a clear description, short and specific; instead of being long and overarching.

Workflow for clinical reporting

Restraints of clinical deliveries: timing annd multiple deliveries; resourcing challenges

Might need to choose between feature and GitFlow.

Reproducible projects in R

To reproduce your work:

  • Git (version control)
  • R libraries
  • Well structured projects
  • Underlying dependencies (e.g. operating systems, C++/C)

Well structured projects

Clear names

Good documentation

R libraries and versions

Check session info; but not the most practical way.

Use global libraries, .libPaths(), this gives you the path where all the packages are installed. Global libraries is useful when using a server for multiple R sessions, where they look for the packages in the same place.

Solutions

  • renv package: makes each project in R self-contained.
  • Checkpoint: project level library paths based on snapshots of CRAN

Use Docker images! Saves R version, operating system, underlying dependencies