Active Data Science for Improving Clinical Risk Prediction

Donna P. Ankerst, Matthias Neumair

Research output: Contribution to journalArticlepeer-review

1 Scopus citations


Clinical risk prediction models are commonly developed in a post-hoc and passive fashion, capitalizing on convenient data from completed clinical trials or retrospective cohorts. Impacts of the models often end at their publication rather than with the patients. The field of clinical risk prediction is rapidly improving in a progressively more transparent data science era. Based on collective experience over the past decade by the Prostate Biopsy Collaborative Group (PBCG), this paper proposes the following four data science-driven strategies for improving clinical risk prediction to the benefit of clinical practice and research. The first proposed strategy is to actively design prospective data collection, monitoring, analysis and validation of risk tools following the same standards as for clinical trials in order to elevate the quality of training data. The second suggestion is to make risk tools and model formulas available online. User-friendly risk tools will bring quantitative information to patients and their clinicians for improved knowledge-based decision-making. As past experience testifies, online tools expedite independent validation, providing helpful information as to whether the tools are generalizable to new populations. The third proposal is to dynamically update and localize risk tools to adapt to changing demographic and clinical landscapes. The fourth strategy is to accommodate systematic missing data patterns across cohorts in order to maximize the statistical power in model training, as well as to accommodate missing information on the end-user side too, in order to maximize utility for the public.

Original languageEnglish
Pages (from-to)177-192
Number of pages16
JournalJournal of Data Science
Issue number2
StatePublished - Apr 2023


  • logistic regression
  • missing data
  • prostate cancer
  • risk calculator


Dive into the research topics of 'Active Data Science for Improving Clinical Risk Prediction'. Together they form a unique fingerprint.

Cite this