Introduction
Healthcare AI has moved from academic prototype to regulatory scrutiny in the space of a few years. Diagnostic algorithms are now subject to the EU AI Act’s high-risk classification, UKCA marking requirements for software as a medical device (SaMD), and NICE evidence standards. The datasets used to train these models are under more scrutiny than ever before.
This article addresses the practical considerations for teams building diagnostic or prognostic AI models on NHS or real-world health data, from initial dataset selection through to regulatory submission.
Dataset Selection
Cohort definition
The single most important decision in any AI development project is cohort definition. A poorly defined training cohort will produce a model that performs well on paper and fails in clinical deployment. Your inclusion/exclusion criteria should reflect the real-world population in which the model will be deployed, not the ideal population that makes your metrics look best.
Common pitfalls include: selecting cohorts from tertiary centres when the model will deploy in primary care, using historical data that predates a significant change in clinical practice, and selecting a time window that over-represents one season or demographic.
Minimum viable dataset size
There is no universal minimum. The rule of thumb of “at least 10 events per predictor variable” (EPV ≥ 10) from logistic regression literature is frequently misapplied to neural networks. For deep learning models on imaging data, you are typically looking at thousands of labelled examples per class as a floor; for structured EHR data, tens of thousands of patient records for a binary outcome model.
Key point: Plan for model validation data from a different institution or time period than your training data. Models that have never seen out-of-distribution data will fail silently in deployment.
Bias Mitigation
NHS data reflects the health inequalities present in the population it serves. Models trained on historically coded data will encode those inequalities unless explicit steps are taken to identify and mitigate them. This includes:
- Disaggregating performance metrics by age, sex, ethnicity, deprivation quintile, and geography
- Auditing training labels for systematic coding differences across demographic groups
- Ensuring under-represented groups meet minimum sample size thresholds for subgroup analysis
- Documenting known limitations in the dataset metadata
NICE’s evidence framework for digital health technologies (EFD) explicitly requires evidence of equitable performance as part of the evidence generation plan for AI tools seeking NHS adoption.
Validation Frameworks
TRIPOD+AI
TRIPOD+AI is the updated reporting guideline for AI-based clinical prediction model development and validation studies. Compliance with TRIPOD+AI is increasingly expected by journals, NICE, and the MHRA as part of regulatory submissions. Your validation plan should be designed with TRIPOD+AI requirements in mind from the outset.
PROBAST-AI
PROBAST-AI is the risk-of-bias assessment tool for AI prediction models. Reviewers at regulatory bodies will use a version of this framework when assessing your evidence package. Common high-risk-of-bias findings include: no validation on data from a different site or time period, no reporting of calibration (not just discrimination), and analysis conducted only at a single decision threshold.
Regulatory Pathways
Software that meets the definition of a medical device under the Medical Device Regulations 2002 (as amended) requires either self-certification (Class I SaMD) or conformity assessment by an approved body (Class IIa and above). The MHRA’s Software and AI as a Medical Device guidance sets out the criteria.
Diagnostic AI tools that influence clinical decision-making are generally Class IIa or higher, requiring involvement of a UK Approved Body. Budget 12–24 months for the full regulatory pathway from validated model to market.
Conclusion
Building clinically effective, regulatorily viable AI on health data is demanding but achievable. The organisations that succeed are those that treat data quality, bias analysis, and validation rigour not as compliance overhead but as scientific prerequisites.

