Notes: Making Data Science work for Clinical Reporting - Part 1

Clinical trial
Data science

This is the Part 1 of a four-part course on Coursera. In this part, there is an introduction to clinical trial phases, and motivation to share data. In addition, some terms (such as CDISC standard) have been introduced.


Chi Zhang


February 6, 2023

This is a course provided by Genentech (part of Roche) on Coursera.

Course link

Introduction to clinical trial

Clinical trial: aim to demonstrate that drug is safe and effective (safety, efficacy)

  • phase 1: 10-20 people, focus on safety (healthy volunteers)
  • phase 2: 100, study of side effects, determine best dose
  • phase 3: 1000, demonstrate drug efficacy, fuller safety profile (common across multiple regions, ethnicities)

collecting data from different hospitals, hence important to ensure standards are being followed

  • evidence must be submitted to health authorities (FDA, EMA European medicines agency)
  • health authorities determine whether the drug is submitted to market

Submit the analysis plan in advance

Why share data

  • regulatory requirements
  • scientific community interest
  • company internal research interest
  • marketing materials

Data and results sharing

  • Regulatory req (e.g. EMA require sharing clinical trial results to gain marketing authorization for pharma products, FDA require sharing data)
  • scientific community (peer review check accuracy, perform additional analyses, derive new hypothesis)
  • CDISC standards
    • CDASH (clinical data acquisition standards harmonization)
    • SDTM (study data tabulation model)
      • format for ‘raw’ data, define datasets, structures, contents, variable attributes
    • SEND (standard for exchange of non clinical data)
    • ADaM (analysis data model)
      • data format for data processed for analysis (e.g. converted, imputed, derived)
  • Dictionary
    • e.g. nose congestion, stuffy nose, … need to be standardized
    • MedDRA: standard dictionary for medical conditions, events and procedures
    • WHO drug dictionary (for pharma agents)
  • SAP statistical analysis plan
    • based on study protocol, focus on statistical methodology, is regulated
  • Programming specification
    • based on SAP, provides additional details on datasets and tables, listing and figures (TLFs) required for statistical analysis. focus on programming details. Not regulated

Quality assurance

  • Good clinical practice (GCP), issued by ICH
  • purpose: prevent mistakes, reduce inefficiencies/waste in a process, increase reliability/trustworthiness of the product of a process
  • Clinical monitoring: performed by a clinical research assistant (CRA) at investigator sites, checks that study protocols are executed as intended, and site processes result in accurate data capture. Focus on trial subjects’ safety. Traditional goal: 100% source data verification
  • Data quality checks (more relevant for data scientists). Checks data for technical conformance, and data plausibility. Focus on data quality. Traditional goal: 100% accurate and format compliant data
  • Code review
  • Double programming

Data access restrictions

  • reasons
    • data collected is very sensitive (health data), need data protection
    • clinical trial data is a key asset and revenue predictor for pharma companies, high confidentiality levels
    • scientific validity, data is ideally double blinded, no-one should know whether a subjecttreatment is as long as the data is still being collected
  • pseudonymization: data de-identification
    • use pseudonym (ID), link is recorded to allow re-identification
  • anonymization: limit the risk of re-identification
    • remove variables, remove values, replace more precise values with more general categories, replace personal identifiers with random identifiers
  • FSP, CRO (out-sourcing), personnels require data access at different levels
  • Unblinding