Cohort design defines the question before any modeling begins
Cohort design is where healthcare ML problems become precise. You decide who is eligible, when follow-up starts, what the observation window is, and which clinical states should be excluded because they make the target trivial or ill-posed.
Cohort definitions also need to be shareable. If another analyst or site cannot reproduce the inclusion, exclusion, and exit logic, the model question is underspecified. In observational health-data work, reusable cohort definitions become part of the evidence package because they determine who the model ever had a chance to learn from.
Cohort choices that change model meaning
| Design choice | Why it matters | Common failure |
|---|---|---|
| Denominator population | Controls which patients the model is actually intended for | Training on a convenience sample and deploying to everyone |
| Index time | Defines when the prediction or inference is made | Using data that becomes available only after the decision point |
| Observation window | Determines which history is allowed as model input | Mixing inconsistent lookback periods across sites or settings |
| Outcome horizon | Connects the task to a realistic intervention window | Choosing a horizon that is too late to be actionable |
The Book of OHDSI: Cohorts
OHDSI chapter detailing cohort entry events, inclusion rules, qualifying cohorts, and cohort exit design.
Review the OHDSI cohort chapterReview of EHR-based phenotyping for research
Peer-reviewed overview of how phenotypes are constructed from electronic health records and why validation remains essential.
Read the phenotyping reviewLabels need time-aware definitions, not only SQL extraction
Many health AI failures begin in target construction. A diagnosis code may appear days after the true clinical event, a medication can be both a predictor and a proxy for clinician suspicion, and an outcome may require chart abstraction instead of a single structured field.
Teams should separate the clinical event from the data artifact used to label it. Sepsis onset, readmission, or medication response may be approximated by codes, orders, or note abstractions, but those proxies need justification because documentation lag and billing incentives can distort the ground truth.
Time-aware label construction
Loading diagram...
Leakage check
If an analyst cannot explain whether a feature would exist at prediction time in the real workflow, treat it as suspect until proven otherwise.
Validation splits should follow time and site boundaries
Random row splits can look impressive while hiding the exact failure modes that matter in healthcare: policy changes, new documentation habits, new scanners, formulary shifts, and different patient mixes. Release evidence should therefore include temporal holdouts and, when possible, external-site or service-line validation.
Leakage-resistant validation path
Loading diagram...
Patient-level separation is non-negotiable
If the same patient’s later admissions, notes, or studies leak into both development and holdout data, reported performance will overstate real-world readiness.
The Book of OHDSI: Patient-level prediction
OHDSI chapter covering patient-level prediction workflows, temporal design, and transportable evaluation patterns.
Review OHDSI patient-level prediction designData quality and site shift deserve first-class engineering
Clinical data pipelines need explicit checks for missingness, coding drift, terminology changes, unit normalization, acquisition differences, and delayed documentation. In healthcare, these are not cleanup tasks after modeling. They are part of the model specification.
OHDSI’s data-quality work turns these checks into concrete categories such as conformance, completeness, and plausibility. That framing is useful for ML teams because it distinguishes schema breakage from clinically implausible values and from expected but operationally meaningful missingness.
- Track missingness as a signal and as a data-quality issue; the fact that a lab was not ordered can be meaningful.
- Audit units, coding systems, and site-specific custom fields before training pooled models.
- Validate phenotype definitions on chart review or gold-standard subsets when a label is clinically important.
- Measure drift separately for inputs, prevalence, workflow timing, and downstream outcomes.
The Book of OHDSI: Data quality
OHDSI chapter on conformance, completeness, plausibility, and systematic review of observational datasets.
Review the OHDSI data-quality chapterOHDSI DataQualityDashboard documentation
Official OHDSI package documentation for executing repeatable OMOP data-quality checks and reviewing results.
Review the DataQualityDashboard docsFDA Good Machine Learning Practice guiding principles
FDA and partner regulators emphasize representative data management, human factors, and total product lifecycle thinking.
Review GMLP principlesKnowledge Check
Test your understanding with this quiz. You need to answer all questions correctly to mark this section as complete.