High Availability and Fault Tolerance
The absolute worst-case scenario for any clinical facility is a forced transition back to manual paper-based workflows during an outage.
RTO vs RPO: Critical Trade-offs
Recovery Time Objective (RTO) defines maximum acceptable downtime (e.g., 30 minutes). Recovery Point Objective (RPO) defines maximum acceptable data loss (e.g., 5 minutes of transactions). Lower RTO/RPO requires more expensive architectures: Active-Active achieves near-zero RTO/RPO but costs 2-3x more than Pilot Light. Clinical systems typically require RTO < 15 minutes and RPO < 5 minutes for critical patient data.
Building a resilient RIS architecture begins with defining the clinical Recovery Time Objective (RTO)—the maximum acceptable downtime—and the Recovery Point Objective (RPO)—the maximum acceptable data loss. To strictly guarantee these SLAs, the architecture relies on an active-active deployment spread symmetrically across multiple AWS Availability Zones (AZs). If an entire datacenter perishes due to a localized disaster (fire, power failure), Elastic Load Balancing instantly routes traffic to surviving AZs, maintaining continuous availability.
Multi-AZ High Availability Architecture
Loading diagram...
DR Patterns Comparison: Cost vs Recovery Objectives
| DR Pattern | Description | Typical RTO | Typical RPO | Relative Cost |
|---|---|---|---|---|
| Pilot Light | Minimal footprint (database replication only); scale up on failover | 1-4 hours | < 1 hour | $ |
| Warm Standby | Scaled-down production environment; always running | 15-60 minutes | < 15 minutes | $$ |
| Active-Active | Full production in multiple AZs; load-balanced | < 5 minutes | ~0 | $$$ |
| Multi-Active | Full production in multiple regions; simultaneous operation | < 1 minute | ~0 | $$$$ |
AWS Healthcare Industry Lens - Well-Architected Framework
AWS best practices for designing resilient healthcare architectures with RTO/RPO guidance.
View Healthcare Industry LensAWS Well-Architected Reliability Pillar
Best practices for building resilient and reliable systems on AWS.
View Reliability PillarAWS Disaster Recovery Strategies
Comprehensive guide to DR patterns and implementation strategies.
Read DR StrategiesMulti-AZ Deployment Best Practices
Architecting for high availability across Availability Zones.
Read Multi-AZ GuideMulti-Region Disaster Recovery
For catastrophic geographic failures (e.g., severe earthquakes), a Multi-Region architecture replicates the topology in a geographically distant AWS Region.
This is useful as a generic recovery pattern before you layer RIS-specific data stores onto it. The exact services may differ, but the active-active idea is the same: health-aware routing in front of replicated state and symmetric regional application capacity.
Route 53 TTL Impact on Failover Time
DNS Time-To-Live (TTL) directly affects failover speed. A TTL of 60 seconds means clients may cache the old IP for up to 60 seconds after failover. For clinical systems, set TTL to 60 seconds or lower on critical endpoints. Route 53 Health Checks evaluate endpoints every 10-30 seconds, but DNS propagation adds TTL delay. Test actual failover time with dig commands during DR drills.
Route 53 Health Check Configuration
Structured JSON example rendered with depth controls for easier inspection.
Click on an annotation to highlight it in the JSON
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Route 53 Failover Routing for Multi-Region DR'
Resources:
PrimaryHealthCheck:
Type: AWS::Route53::HealthCheck
Properties:
HealthCheckConfig:
Port: 443
Type: HTTPS
ResourcePath: /health
FullyQualifiedDomainName: ris-primary.hospital.example.com
RequestInterval: 10
FailureThreshold: 3
PrimaryRecordSet:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneName: hospital.example.com.
Name: ris.hospital.example.com
Type: A
SetIdentifier: primary
Failover: PRIMARY
HealthCheckId: !Ref PrimaryHealthCheck
AliasTarget:
DNSName: primary-alb.us-east-1.elb.amazonaws.com
HostedZoneId: Z35SXDOTRQ7X7K
SecondaryRecordSet:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneName: hospital.example.com.
Name: ris.hospital.example.com
Type: A
SetIdentifier: secondary
Failover: SECONDARY
AliasTarget:
DNSName: secondary-alb.us-west-2.elb.amazonaws.com
HostedZoneId: Z1H1FL5HABSF5Aurora Global Database Replication Lag
Aurora Global Database uses dedicated infrastructure for cross-region replication, achieving typical latencies under 1 second for most AWS Regions. This is significantly faster than standard asynchronous replication. The secondary cluster can be promoted to primary in < 30 seconds, making it ideal for clinical systems requiring RTO < 5 minutes.
# Create primary Aurora cluster in us-east-1
aws rds create-db-cluster \
--db-cluster-identifier ris-aurora-primary \
--engine aurora-mysql \
--engine-version 8.0.mysql_aurora.3.05.0 \
--master-username admin \
--master-user-password SecurePass123! \
--region us-east-1
# Create Aurora Global Database
aws rds create-global-cluster \
--global-cluster-identifier ris-global-db \
--source-db-cluster-identifier arn:aws:rds:us-east-1:123456789012:cluster:ris-aurora-primary \
--region us-east-1
# Add secondary region (us-west-2)
aws rds create-db-cluster \
--db-cluster-identifier ris-aurora-secondary \
--engine aurora-mysql \
--global-cluster-identifier ris-global-db \
--region us-west-2DynamoDB Global Tables Conflict Resolution
DynamoDB Global Tables use last-writer-wins conflict resolution based on timestamp. If two regions write to the same item simultaneously, the write with the latest timestamp wins. Design your schema to minimize cross-region conflicts: use region-specific partition keys for frequently updated items, or implement application-level conflict resolution for critical patient data.
DynamoDB Global Tables IAM Policy
Structured JSON example rendered with depth controls for easier inspection.
Click on an annotation to highlight it in the JSON
S3 Cross-Region Replication Time Considerations
S3 CRR is asynchronous—new objects typically replicate within minutes, but large objects or high-throughput scenarios may experience delays. CRR only applies to new objects after enabling replication; existing objects require S3 Batch Replication. For DICOM archives, enable CRR with SameTimeReplicationAccess for objects > 128KB. Monitor replication metrics via S3 Replication Time Control (RTC) for SLA-backed replication within 15 minutes.
Multi-Region Disaster Recovery Architecture
Loading diagram...
DR Architecture Patterns Comparison
Loading diagram...
Aurora Global Database Failover Flow
Loading diagram...
Crucially, an untested DR strategy is functionally useless. Healthcare organizations must frequently simulate DR drills to guarantee that the automated Route 53 switch-over executes reliably.
Disaster recovery of workloads on AWS: Recovery in the cloud
Defining architectural strategies like Pilot Light and Warm Standby.
Read DR WhitepaperRoute 53 Failover Routing
Failover-routing behavior and health-check-driven DNS switching for regional disaster recovery.
View Failover RoutingAmazon Aurora Global Database
Aurora Global Database setup, configuration, and failover procedures.
View Aurora Global DocsDynamoDB Global Tables
Multi-region table replication with automatic conflict resolution.
Read Global Tables GuideS3 Cross-Region Replication
Configure automatic replication of objects across AWS Regions.
View S3 CRR DocsAWS Multi-Region Fundamentals
Comprehensive guide to multi-region architecture patterns.
View AWS Prescriptive GuidanceDisaster Recovery of Workloads on AWS
AWS disaster recovery whitepaper covering recovery patterns and implementation tradeoffs.
Explore DR GuidanceAmazon Application Recovery Controller (ARC)
To facilitate highly rapid recovery and limit the impact of isolated impairments, the architecture leverages the Amazon Application Recovery Controller (ARC). ARC provides advanced mechanisms to rapidly shift clinical user traffic away from degraded infrastructure components before an outright failure occurs.
ARC Zonal Shift vs Zonal Autoshift
Zonal Shift is manual—administrators explicitly shift traffic away from a degraded AZ via API or console. Zonal Autoshift is automatic—AWS monitors AZ health and automatically shifts traffic when impairment is detected, with no customer action required. Zonal Autoshift requires opt-in and works with ALB, NLB, and Route 53. For clinical systems, enable Autoshift for automatic protection, but maintain Zonal Shift capability for planned maintenance.
| ARC Capability | Scope | Primary Function |
|---|---|---|
| Routing Control | Regional | Manually reroutes global traffic from one AWS Region to another. |
| Region Switch | Regional | Customer-initiated action to safely transition workloads to a standby Region. |
| Zonal Shift | Zonal | Shift traffic away from a specific, degraded Availability Zone to healthy AZs. |
| Zonal Autoshift | Zonal | AWS-initiated action that automatically shifts traffic away from an impaired AZ. |
import { Route53RecoveryClusterClient, UpdateZonalShiftCommand } from '@aws-sdk/client-route53-recovery-cluster';
const client = new Route53RecoveryClusterClient({ region: 'us-east-1' });
async function initiateZonalShift(
resourceIdentifier: string,
reason: string
) {
const command = new UpdateZonalShiftCommand({
resourceIdentifier,
action: 'SHIFT_ZONAL_TRAFFIC',
reason,
expiresInMinutes: 60, // Auto-expire after 60 minutes
});
const response = await client.send(command);
console.log('Zonal shift initiated:', response);
return response;
}
// Usage: Shift traffic away from us-east-1a due to elevated latency
await initiateZonalShift(
'arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/net/ris-nlb/abc123',
'Elevated latency and error rates in us-east-1a during peak hours'
);ARC Zonal Shift Workflow
Loading diagram...
For a Multi-AZ RIS deployment, ARC provides zonal shift, allowing network administrators to manually route application traffic completely away from a specific Availability Zone that is exhibiting elevated latency or error rates. For global Multi-Region deployments, ARC provides routing control and Region switch capabilities to safely redirect clinical user traffic from the primary impaired region to the secondary standby region.
Amazon Application Recovery Controller (ARC)
Overview of ARC readiness checks, routing controls, zonal shift, and region switch patterns.
View ARC OverviewCompare Multi-AZ and Multi-Region Recovery Capabilities in ARC
Detailed comparison of ARC traffic management capabilities.
Read AWS DocumentationAWS Well-Architected Reliability Pillar
Reliability best practices and implementation guidance.
View Reliability LensAWS Global Infrastructure
Current AWS Region and Availability Zone inventory for selecting primary and standby deployment locations.
View Region InventoryS3 Storage Class Lifecycle for Medical Imaging
Optimizing storage costs with S3 Intelligent-Tiering and Glacier for DICOM archives.
Explore S3 Storage ClassesAmazon CloudFront for Global Content Delivery
CloudFront distribution behavior, caching, and edge-delivery patterns for web viewers and report assets.
View CloudFront OverviewAWS HealthImaging for DICOM Storage
Purpose-built service for storing and accessing medical imaging data.
Read HealthImaging DocsHIPAA Contingency Plan Requirements (§164.308(a)(7))
45 CFR §164.308 administrative safeguards, including contingency-planning and emergency-mode operation requirements.
View 45 CFR §164.308Disaster Recovery Runbook
A comprehensive DR runbook ensures consistent, repeatable failover procedures during crisis situations. This runbook should be tested quarterly and updated after every infrastructure change.
HIPAA Contingency Plan Testing Requirements
HIPAA §164.308(a)(7) requires healthcare organizations to test contingency plans regularly. Documentation must include: test date, participants, scenarios tested, results, and corrective actions. Failure to test DR procedures can result in HIPAA violations during audits. Maintain detailed logs of all DR drills for compliance reporting.
Pre-Failover Checklist
- Confirm primary region failure via Route 53 Health Checks and CloudWatch alarms
- Notify incident response team and clinical stakeholders
- Verify secondary region health status (all services operational)
- Confirm Aurora Global Database replication lag is < 1 second
- Verify DynamoDB Global Tables are in sync across regions
- Check S3 CRR status for critical DICOM objects
- Document incident start time and initial impact assessment
- Activate incident communication channel (Slack/PagerDuty)
Failover Execution Steps
- Initiate ARC Routing Control to shift traffic to secondary region
- Verify Route 53 DNS propagation (use
dig ris.hospital.example.com @8.8.8.8) - Promote Aurora Global Database secondary cluster to primary (if automatic failover did not occur)
- Update application configuration to point to secondary region endpoints
- Verify DynamoDB Global Tables are accepting writes in secondary region
- Enable enhanced monitoring on secondary region resources
- Notify clinical users of failover completion and any expected limitations
- Document all actions taken with timestamps
Post-Failover Validation
- Verify end-to-end patient registration workflow
- Test radiology order creation and status updates
- Confirm DICOM image upload and retrieval from secondary region
- Validate billing transaction processing
- Test user authentication and authorization
- Monitor error rates and latency via CloudWatch dashboards
- Confirm backup jobs are running in secondary region
- Schedule post-incident review within 48 hours
Failback Procedures
- Confirm primary region is fully restored and stable
- Re-establish Aurora Global Database with original primary as source
- Allow DynamoDB Global Tables to resynchronize (monitor replication lag)
- Schedule maintenance window for failback (off-peak hours)
- Initiate ARC Routing Control to shift traffic back to primary region
- Verify all services operational in primary region
- Update documentation with lessons learned
- Conduct post-incident review and update runbook
Communication Templates
SUBJECT: [URGENT] RIS Disaster Recovery Activated - 2026-03-13T12:15:29.466Z
TO: Clinical Leadership, IT Operations, Radiology Department
INCIDENT SUMMARY:
- Incident Type: Primary Region Failure | AZ Degradation | Planned DR Drill
- Detection Time: YYYY-MM-DD HH:MM UTC
- Affected Systems: RIS Application, PACS Integration, Billing Module
- Current Status: Failover in progress | Failover complete | Testing phase
EXPECTED IMPACT:
- RTO Target: < 30 minutes
- Expected Downtime: 0-5 minutes | 5-15 minutes | 15-30 minutes
- Data Loss (RPO): None expected | < 5 minutes | < 15 minutes
NEXT UPDATE: 30 minutes | 1 hour | Upon completion
INCIDENT COMMANDER: Name, Contact
TECHNICAL LEAD: Name, ContactWhat is AWS Resilience Hub?
How Resilience Hub models applications, computes RTO/RPO posture, and generates recovery recommendations.
View Resilience Hub GuideAWS Elastic Disaster Recovery
Block-level replication and orchestration patterns for low-RPO recovery of application servers.
Explore Elastic DRFailover Testing and Validation
Regular failover testing validates DR procedures and identifies gaps before real incidents. Healthcare organizations must test DR capabilities according to compliance requirements and clinical SLAs.
DR Drill Documentation Requirements
Maintain comprehensive documentation for all DR drills: test objectives, participants, scenarios executed, results, issues discovered, and corrective actions. This documentation serves as evidence for HIPAA compliance audits and internal quality reviews. Store drill reports in a dedicated compliance repository with version control.
DR Testing Types and Frequency
| Test Type | Description | Frequency | Duration | Impact |
|---|---|---|---|---|
| Tabletop Exercise | Walkthrough of DR runbook with stakeholders | Monthly | 2-4 hours | None |
| Technical Test | Validate backup restoration and DNS failover | Quarterly | 4-8 hours | Minimal (non-production) |
| Partial Failover | Failover non-critical services to DR region | Semi-annually | 8-12 hours | Low (scheduled maintenance) |
| Full Failover | Complete production failover to DR region | Annually | 12-24 hours | Moderate (requires clinical coordination) |
AWS Fault Injection Simulator Integration
AWS Fault Injection Simulator (FIS) enables controlled chaos engineering to validate resilience without manual intervention. Create experiments that simulate AZ failures, API throttling, and network latency to proactively identify weaknesses.
Fault Injection Simulator Safety Controls
FIS experiments include built-in safety controls: stop conditions automatically halt experiments if CloudWatch alarms trigger (e.g., error rate > 10%), targets limit scope to specific resources, and actions have defined durations. Always test FIS experiments in non-production first. For clinical systems, schedule FIS experiments during maintenance windows with clinical stakeholder approval.
AWS FIS Experiment Template (AZ Failure Simulation)
Structured JSON example rendered with depth controls for easier inspection.
Click on an annotation to highlight it in the JSON
Test Result Documentation Template
DISASTER RECOVERY DRILL REPORT
=====================================
Drill ID: DR-2026-XXX
Date: YYYY-MM-DD
Time: HH:MM - HH:MM UTC
Type: Tabletop | Technical | Partial Failover | Full Failover
PARTICIPANTS:
- Incident Commander: [Name]
- Technical Lead: [Name]
- Clinical Liaison: [Name]
- IT Operations: [Names]
SCENARIOS TESTED:
1. [Scenario description]
2. [Scenario description]
RESULTS:
☐ RTO Achieved: Target = [X] minutes, Actual = [Y] minutes
☐ RPO Achieved: Target = [X] minutes, Actual = [Y] minutes
☐ Route 53 Failover: Success | Partial | Failed
☐ Aurora Promotion: Success | Partial | Failed
☐ DynamoDB Sync: Success | Partial | Failed
☐ Application Health: Success | Partial | Failed
ISSUES IDENTIFIED:
1. [Issue description with severity]
2. [Issue description with severity]
CORRECTIVE ACTIONS:
1. [Action item, owner, due date]
2. [Action item, owner, due date]
NEXT DRILL DATE: YYYY-MM-DD
APPROVED BY: [Name, Title]AWS Fault Injection Simulator
Chaos-engineering experiments for validating steady-state assumptions and recovery procedures.
View FIS OverviewAWS Resiliency
Comprehensive resiliency best practices and implementation guidance.
Read Resiliency GuideAWS Backup Centralized Management
AWS Backup provides centralized, automated backup management across AWS services and accounts. For healthcare organizations, AWS Backup simplifies compliance reporting and ensures consistent backup policies across all RIS components.
AWS Backup Cross-Region Copy
Enable cross-region copy in AWS Backup plans to automatically replicate backups to a secondary region. This provides an additional layer of protection beyond service-level replication (like Aurora Global). Cross-region copies incur data transfer costs but ensure backup availability even if the primary region is completely unavailable. Configure copy rules with different retention periods for cost optimization.
AWS Backup Plan Configuration
Structured JSON example rendered with depth controls for easier inspection.
Click on an annotation to highlight it in the JSON
AWS Backup Centralized Management
Loading diagram...
Backup Plan Best Practices
- Tag-based assignment: Use resource tags to automatically include new resources in backup plans
- Lifecycle policies: Move older backups to cold storage (Glacier) for cost optimization
- Cross-region copies: Maintain backup copies in DR region for additional protection
- Encryption: Enable KMS encryption for all backup vaults with customer-managed keys
- Access control: Implement least-privilege IAM policies for backup operations
- Monitoring: Configure CloudWatch alarms for backup job failures
- Compliance reporting: Use AWS Backup Audit Manager for continuous compliance monitoring
AWS Transit Gateway
Network transit hub for connecting VPCs and on-premises networks.
View Transit Gateway DocsRTO/RPO Calculator and Business Impact Analysis
Calculate RTO and RPO requirements based on clinical workflow impact. This worksheet helps quantify the business impact of downtime and determine appropriate DR investment levels.
RTO Calculation Worksheet
RTO Impact Analysis
| Downtime Duration | Clinical Impact | Revenue Loss (Est.) | Compliance Risk |
|---|---|---|---|
| 0-15 minutes | Minimal - brief delay in order processing | < $1,000 | None |
| 15-60 minutes | Moderate - radiology workflow delays | $1,000-$10,000 | Low |
| 1-4 hours | Significant - manual workflows required | $10,000-$50,000 | Moderate |
| 4-24 hours | Severe - potential patient care impact | $50,000-$250,000 | High |
| > 24 hours | Critical - emergency protocols activated | > $250,000 | Critical |
RPO Calculation Worksheet
RPO Data Loss Tolerance
| Data Type | Update Frequency | Max Acceptable Loss | Replication Strategy |
|---|---|---|---|
| Patient Demographics | Per registration | 0 (synchronous) | RDS Multi-AZ synchronous |
| Radiology Orders | Per order | < 5 minutes | Aurora Global (<1s lag) |
| Report Status | Per update | < 15 minutes | DynamoDB Global Tables |
| DICOM Images | Per study | < 1 hour | S3 CRR + Backup |
| Billing Transactions | Per transaction | 0 (synchronous) | RDS Multi-AZ + Backup |
Cost Estimation for DR Patterns
DR Pattern Cost Comparison (Monthly Estimate)
| Component | Pilot Light | Warm Standby | Active-Active | Multi-Active |
|---|---|---|---|---|
| Database (Aurora) | $200 | $800 | $1,600 | $3,200 |
| Compute (EC2/ECS) | $50 | $400 | $800 | $1,600 |
| Storage (S3 CRR) | $100 | $100 | $200 | $400 |
| Network (Data Transfer) | $50 | $100 | $200 | $400 |
| Route 53 + ARC | $30 | $30 | $50 | $100 |
| Total (Monthly) | $430 | $1,430 | $2,850 | $5,700 |
Business Impact Analysis Template
BUSINESS IMPACT ANALYSIS - RIS APPLICATION
================================================
Assessment Date: YYYY-MM-DD
Assessed By: [Name, Title]
CRITICAL BUSINESS PROCESSES:
1. Patient Registration
- Maximum Tolerable Downtime: [X] hours
- Financial Impact per Hour: $[X]
- Regulatory Impact: [Description]
2. Radiology Order Management
- Maximum Tolerable Downtime: [X] hours
- Financial Impact per Hour: $[X]
- Regulatory Impact: [Description]
3. Image Acquisition & Storage
- Maximum Tolerable Downtime: [X] hours
- Financial Impact per Hour: $[X]
- Regulatory Impact: [Description]
4. Report Generation & Distribution
- Maximum Tolerable Downtime: [X] hours
- Financial Impact per Hour: $[X]
- Regulatory Impact: [Description]
DERIVED REQUIREMENTS:
- RTO: [X] minutes/hours (based on most critical process)
- RPO: [X] minutes/hours (based on data update frequency)
- Recommended DR Pattern: [Pilot Light | Warm Standby | Active-Active | Multi-Active]
- Estimated Monthly DR Cost: $[X]
APPROVALS:
- CIO: [Signature, Date]
- CMIO: [Signature, Date]
- Compliance Officer: [Signature, Date]HIPAA Contingency Plan
45 CFR §164.308 contingency-planning obligations for backup, disaster recovery, and emergency-mode operations.
View 45 CFR §164.308Knowledge Check
Test your understanding with this quiz. You need to answer all questions correctly to mark this section as complete.