Notes: Making Data Science work for Clinical Reporting - Part 2
This is the Part 2 of a four-part course on Coursera. In this part, agile and DevOps practices are introduced, along with version control with Git and reproducible R projects.
This is a course provided by Genentech (part of Roche) on Coursera.
Course link
Agile mindset and DevOps practices
Data science as a new way of thinking
New way of working means
- leverage standards and automation (CI/CD)
- adopt new data types quickly, reusing data for multiple purposes, pooling data, data marts
- open-sourcing and collaborating cross pharma (small, readable, self-tested code)
- coding for reusability, moving away from single-use programs
- rapidly re-arranginng re-usable components to meet analytical need at hand
Data scientist need to have hard skills, such as
- SAS, R, Python, JS, bash
- cloud, containers
- CI/CD tools
- visualisation
- knowledge of various data types
and also soft skills:
- collaborative and inclusive
- transparent and practical
- creative and proactive
- asking the right questions
- able to wear many hats, be more flexible and resilient
Agile
Project management; a mindset: uncover better ways of working, by doing and helping others do it.
1st principle: highest priority is to satisfy the customer through early and continuous delivery of valuable software.
Implementations: Kanban, Scrum, Lean, Extreme programming
Tools:
- backlog
- kanban board (not started, in progress, done)
- WIP (work in progress limit)
- progress measures: e.g. team velocity
DevOps
Increase efficiency by improving the connection between Dev (software development) and Ops (IT operations).
The goal is continuous delivery and continuous improvement.
Practices:
- modular architecture
- version control
- merge into trunk daily
- automated and continuous testing, continuous integration
- automated deployment
DevOps in clinical reporting
Risks around production run:
- are all dependencies in production?
- was all quality control completed and successful?
- is all documentation complete?
- was the transfer to eDMS correct and successful?
Version control
Feature branch (as opposed to master branch): one task per branch
name feature branch: issue number and description
Each issue should have a clear description, short and specific; instead of being long and overarching.
Workflow for clinical reporting
Restraints of clinical deliveries: timing annd multiple deliveries; resourcing challenges
Might need to choose between feature and GitFlow.
Reproducible projects in R
To reproduce your work:
- Git (version control)
- R libraries
- Well structured projects
- Underlying dependencies (e.g. operating systems, C++/C)
Well structured projects
Clear names
Good documentation
R libraries and versions
Check session info; but not the most practical way.
Use global libraries, .libPaths()
, this gives you the path where all the packages are installed. Global libraries is useful when using a server for multiple R sessions, where they look for the packages in the same place.
Solutions
renv
package: makes each project in R self-contained.- Checkpoint: project level library paths based on snapshots of CRAN
Use Docker images! Saves R version, operating system, underlying dependencies