Data Engineering, Cohorts, and Labels

Cohort design defines the question before any modeling begins

Cohort design is where healthcare ML problems become precise. You decide who is eligible, when follow-up starts, what the observation window is, and which clinical states should be excluded because they make the target trivial or ill-posed.

Cohort definitions also need to be shareable. If another analyst or site cannot reproduce the inclusion, exclusion, and exit logic, the model question is underspecified. In observational health-data work, reusable cohort definitions become part of the evidence package because they determine who the model ever had a chance to learn from.

Cohort choices that change model meaning

Design choice	Why it matters	Common failure
Denominator population	Controls which patients the model is actually intended for	Training on a convenience sample and deploying to everyone
Index time	Defines when the prediction or inference is made	Using data that becomes available only after the decision point
Observation window	Determines which history is allowed as model input	Mixing inconsistent lookback periods across sites or settings
Outcome horizon	Connects the task to a realistic intervention window	Choosing a horizon that is too late to be actionable

OHDSI cohort design diagram showing staged cohort entry, inclusion, exclusion, qualifying cohort logic, and cohort exit. — OHDSI cohort-design figure from the Cohorts chapter. It is included here because it shows that cohort logic is a staged contract covering entry, inclusion, exclusion, qualifying logic, and exit rather than one SQL filter.

Source: The Book of OHDSI, Cohorts chapterLast verified: 2026-03-12

The Book of OHDSI: Cohorts

OHDSI chapter detailing cohort entry events, inclusion rules, qualifying cohorts, and cohort exit design.

Review the OHDSI cohort chapter

Review of EHR-based phenotyping for research

Peer-reviewed overview of how phenotypes are constructed from electronic health records and why validation remains essential.

Read the phenotyping review

Labels need time-aware definitions, not only SQL extraction

Many health AI failures begin in target construction. A diagnosis code may appear days after the true clinical event, a medication can be both a predictor and a proxy for clinician suspicion, and an outcome may require chart abstraction instead of a single structured field.

Teams should separate the clinical event from the data artifact used to label it. Sepsis onset, readmission, or medication response may be approximated by codes, orders, or note abstractions, but those proxies need justification because documentation lag and billing incentives can distort the ground truth.

Time-aware label construction

100%drag to pan

Loading diagram...

Leakage check

If an analyst cannot explain whether a feature would exist at prediction time in the real workflow, treat it as suspect until proven otherwise.

Validation splits should follow time and site boundaries

Random row splits can look impressive while hiding the exact failure modes that matter in healthcare: policy changes, new documentation habits, new scanners, formulary shifts, and different patient mixes. Release evidence should therefore include temporal holdouts and, when possible, external-site or service-line validation.

Leakage-resistant validation path

100%drag to pan

Loading diagram...

Patient-level separation is non-negotiable

If the same patient’s later admissions, notes, or studies leak into both development and holdout data, reported performance will overstate real-world readiness.

The Book of OHDSI: Patient-level prediction

OHDSI chapter covering patient-level prediction workflows, temporal design, and transportable evaluation patterns.

Review OHDSI patient-level prediction design

Data quality and site shift deserve first-class engineering

Clinical data pipelines need explicit checks for missingness, coding drift, terminology changes, unit normalization, acquisition differences, and delayed documentation. In healthcare, these are not cleanup tasks after modeling. They are part of the model specification.

OHDSI’s data-quality work turns these checks into concrete categories such as conformance, completeness, and plausibility. That framing is useful for ML teams because it distinguishes schema breakage from clinically implausible values and from expected but operationally meaningful missingness.

Track missingness as a signal and as a data-quality issue; the fact that a lab was not ordered can be meaningful.
Audit units, coding systems, and site-specific custom fields before training pooled models.
Validate phenotype definitions on chart review or gold-standard subsets when a label is clinically important.
Measure drift separately for inputs, prevalence, workflow timing, and downstream outcomes.

OHDSI DataQualityDashboard screenshot summarizing conformance, completeness, and plausibility checks with pass and fail counts. — OHDSI DataQualityDashboard example output. It is included because it shows how health-data quality review becomes an operational gate with categorized failures, not an informal preprocessing note.

Source: OHDSI DataQualityDashboardLast verified: 2026-03-12

The Book of OHDSI: Data quality

OHDSI chapter on conformance, completeness, plausibility, and systematic review of observational datasets.

Review the OHDSI data-quality chapter

OHDSI DataQualityDashboard documentation

Official OHDSI package documentation for executing repeatable OMOP data-quality checks and reviewing results.

Review the DataQualityDashboard docs

FDA Good Machine Learning Practice guiding principles

FDA and partner regulators emphasize representative data management, human factors, and total product lifecycle thinking.

Review GMLP principles

Knowledge Check

Test your understanding with this quiz. You need to answer all questions correctly to mark this section as complete.

Quiz Progress

Question 1 of 5

What is the purpose of defining an index time in a healthcare prediction problem?

← PreviousHealthcare Data Standards & Modalities Next →Clinical Prediction & Risk Modeling

Design choice

Why it matters

Common failure

Denominator population

Controls which patients the model is actually intended for

Training on a convenience sample and deploying to everyone

Index time

Defines when the prediction or inference is made

Using data that becomes available only after the decision point

Observation window

Determines which history is allowed as model input

Mixing inconsistent lookback periods across sites or settings

Outcome horizon

Connects the task to a realistic intervention window

Choosing a horizon that is too late to be actionable

Knowledge Check

Test your understanding with this quiz. You need to answer all questions correctly to mark this section as complete.

Quiz Progress

Question 1 of 5

Data Engineering, Cohorts, and Labels

Knowledge Tree

Cohort design defines the question before any modeling begins

The Book of OHDSI: Cohorts

Review of EHR-based phenotyping for research

Labels need time-aware definitions, not only SQL extraction

Time-aware label construction

Leakage check

Validation splits should follow time and site boundaries

Leakage-resistant validation path

Patient-level separation is non-negotiable

The Book of OHDSI: Patient-level prediction

Data quality and site shift deserve first-class engineering

The Book of OHDSI: Data quality

OHDSI DataQualityDashboard documentation

FDA Good Machine Learning Practice guiding principles

Knowledge Check

What is the purpose of defining an index time in a healthcare prediction problem?