# Scaling Roadmap — BD FHIR National **Audience:** DGHS infrastructure team, future system architects **Current phase:** Pilot (Phase 1) --- ## Phase thresholds | Metric | Phase 1 (Pilot) | Phase 2 (Regional) | Phase 3 (National) | |--------|-----------------|---------------------|---------------------| | Vendors | <50 | <500 | >500 | | Resources/day | <10,000 | <100,000 | >1,000,000 | | Resources total | <1M | <10M | >10M | | HAPI replicas | 1 | 3 | 5-10+ | | Orchestrator | docker-compose | docker-compose | Kubernetes | | PostgreSQL | Single instance | Primary + replica | Patroni HA cluster | | Estimated trigger | Now | 6-18 months | 18-36 months | --- ## Phase 1 → Phase 2 changes ### 1. Scale HAPI replicas to 3 No configuration changes required — the architecture was designed for this from day one. ```bash # On the production Ubuntu server cd /opt/bd-fhir-national docker-compose --env-file .env up -d --scale hapi=3 ``` **Verify after scaling:** ```bash # All 3 replicas healthy docker-compose ps hapi # nginx is load balancing across all 3 # (check HAPI logs — requests should appear in all replica logs) docker-compose logs --tail=50 hapi # pgBouncer pool has sufficient capacity # 3 replicas × 5 HikariCP connections = 15 connections # pgBouncer pool_size=20 — 5 headroom remaining. Acceptable. ``` **pgBouncer adjustment at 3+ replicas:** At 5 replicas (5 × 5 = 25 connections), the current pgBouncer pool_size=20 becomes a bottleneck. Update docker-compose.yml: ```yaml # pgbouncer-fhir environment: PGBOUNCER_DEFAULT_POOL_SIZE: "30" # was 20 # And increase postgres-fhir max_connections in postgresql.conf: # max_connections = 40 # was 30 ``` ### 2. Add PostgreSQL streaming replication (read replica) For read-heavy workloads (FHIR search, bulk export), add a read replica. HAPI supports separate read and write datasource URLs. ```yaml # Add to docker-compose.yml: postgres-fhir-replica: image: postgres:15-alpine environment: POSTGRES_DB: fhirdb PGUSER: replicator POSTGRES_PASSWORD: ${FHIR_REPLICA_PASSWORD} volumes: - postgres-fhir-replica-data:/var/lib/postgresql/data - ./postgres/fhir/replica.conf:/etc/postgresql/postgresql.conf:ro command: > bash -c " until pg_basebackup -h postgres-fhir -U replicator -D /var/lib/postgresql/data -P -Xs -R; do sleep 5; done && postgres -c config_file=/etc/postgresql/postgresql.conf" networks: - backend-fhir ``` Add `HAPI_DATASOURCE_READ_URL` environment variable pointing to the replica, and update `DataSourceConfig.java` to configure a separate read datasource. ### 3. Add Redis for distributed JWKS cache Currently each HAPI replica maintains an independent in-memory JWKS cache. At 3 replicas, a Keycloak key rotation triggers 3 independent JWKS re-fetches within the same second. This is acceptable. At 10+ replicas, add Redis for a shared JWKS cache to reduce Keycloak load. ```yaml # Add to docker-compose.yml: redis: image: redis:7-alpine networks: - frontend - backend-fhir command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD} ``` Update `KeycloakJwtInterceptor` to use Spring Cache with Redis backend for JWKS storage. --- ## Phase 2 → Phase 3 changes ### Move to Kubernetes At national scale, docker-compose is not the correct orchestrator. Kubernetes provides: - Horizontal Pod Autoscaler (scale on CPU/RPS automatically) - Rolling deployments (zero-downtime IG version upgrades) - Pod Disruption Budgets (maintain minimum replicas during node maintenance) - Namespace isolation (separate FHIR, audit, monitoring namespaces) **Kubernetes equivalents:** | docker-compose service | Kubernetes resource | |------------------------|---------------------| | hapi (--scale N) | Deployment + HPA | | postgres-fhir | StatefulSet (or external Patroni) | | postgres-audit | StatefulSet (or external Patroni) | | pgbouncer-fhir | Deployment (sidecar or standalone) | | nginx | Ingress (nginx-ingress-controller) | ### Partition HAPI JPA tables At 5M+ resources in `HFJ_RESOURCE`, evaluate partitioning (see V1 migration comments). Prerequisites before partitioning HAPI JPA tables: 1. HAPI must be stopped during the migration (ALTER TABLE is not online in PostgreSQL 15) 2. Foreign key references to HFJ_RESOURCE from all SPIDX tables must be updated 3. The partition key must be included in all primary keys 4. Hibernate DDL validation must be disabled during migration, then re-enabled This is a planned maintenance window operation — minimum 4-hour downtime window for a database with 5M resources. At 10,000 resources/day, you have approximately 18 months to plan this migration from initial deployment. **Trigger:** Run `EXPLAIN ANALYZE` on a representative FHIR search query. When sequential scans on HFJ_RESOURCE appear in the plan despite indexes, partitioning is overdue. --- ## Partition maintenance — monthly cron job The audit tables are partitioned by month with partitions pre-created through 2027. **A missing partition causes INSERT to fail with a hard error** — no graceful degradation. ### Setup (run once on the audit PostgreSQL host) ```bash # Create a login user for the maintenance function # (audit_maintainer_login was created by postgres/audit/init.sql) # Add to crontab on the Ubuntu host (or in a scheduled container): crontab -e # Run on the 20th of each month at 00:00 UTC — creates next month's partition 0 0 20 * * docker exec bd-postgres-audit psql \ -U audit_maintainer_login \ -d auditdb \ -c "SELECT audit.create_next_month_partitions();" \ >> /var/log/bd-fhir-partition-maintenance.log 2>&1 ``` ### Verify partition creation ```bash # After the cron runs, verify the new partition exists docker exec bd-postgres-audit psql -U postgres -d auditdb -c " SELECT c.relname AS partition_name, pg_get_expr(c.relpartbound, c.oid) AS partition_range FROM pg_class c JOIN pg_inherits i ON i.inhrelid = c.oid JOIN pg_class p ON p.oid = i.inhparent JOIN pg_namespace n ON n.oid = p.relnamespace WHERE n.nspname = 'audit' AND p.relname = 'audit_events' ORDER BY c.relname DESC LIMIT 3; " # Should show the three most recent monthly partitions ``` ### Monitor for missing partitions Add this check to your monitoring system (Prometheus alerting or cron): ```bash #!/bin/bash # check_audit_partitions.sh # Alert if the next month's partition does not exist by the 25th NEXT_MONTH=$(date -d "+1 month" +%Y_%m) PARTITION="audit_events_${NEXT_MONTH}" RESULT=$(docker exec bd-postgres-audit psql -U postgres -d auditdb -tAc " SELECT COUNT(*) FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = 'audit' AND c.relname = '${PARTITION}';") if [ "$RESULT" -eq "0" ]; then echo "ALERT: Missing audit partition for next month: ${PARTITION}" # Send to your alerting system (PagerDuty, Slack, email) exit 1 fi echo "OK: Partition ${PARTITION} exists" exit 0 ``` --- ## Monitoring — key metrics to track These metrics indicate when scaling actions are needed. ### PostgreSQL — fhir ```sql -- Connection utilisation (should be <80% of max_connections) SELECT count(*) as active_connections, max_conn, round(100.0 * count(*) / max_conn, 1) as utilisation_pct FROM pg_stat_activity, (SELECT setting::int as max_conn FROM pg_settings WHERE name='max_connections') mc WHERE state = 'active' GROUP BY max_conn; -- Table bloat (trigger VACUUM if dead_tuple_ratio > 10%) SELECT relname, n_live_tup, n_dead_tup, round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) as dead_pct FROM pg_stat_user_tables WHERE relname IN ('hfj_resource', 'hfj_spidx_token', 'hfj_res_ver') ORDER BY dead_pct DESC; -- Index usage (trigger REINDEX if idx_scan is 0 for a non-new index) SELECT relname, indexrelname, idx_scan, idx_tup_read FROM pg_stat_user_indexes WHERE relname LIKE 'hfj_%' ORDER BY idx_scan ASC LIMIT 10; ``` ### PostgreSQL — audit ```sql -- Partition sizes (plan next archive when any partition exceeds 10GB) SELECT c.relname as partition, pg_size_pretty(pg_relation_size(c.oid)) as size FROM pg_class c JOIN pg_inherits i ON i.inhrelid = c.oid JOIN pg_class p ON p.oid = i.inhparent JOIN pg_namespace n ON n.oid = p.relnamespace WHERE n.nspname = 'audit' AND p.relname = 'audit_events' ORDER BY c.relname DESC; -- Rejection rate by vendor (flag vendors with >10% rejection rate) SELECT client_id, COUNT(*) as total_events, SUM(CASE WHEN outcome = 'REJECTED' THEN 1 ELSE 0 END) as rejections, ROUND(100.0 * SUM(CASE WHEN outcome = 'REJECTED' THEN 1 ELSE 0 END) / COUNT(*), 1) as rejection_pct FROM audit.audit_events WHERE event_time > NOW() - INTERVAL '7 days' AND event_type IN ('OPERATION', 'VALIDATION_FAILURE') GROUP BY client_id ORDER BY rejection_pct DESC; ``` ### HAPI — Prometheus metrics Key metrics exposed at `/actuator/prometheus`: | Metric | Alert threshold | |--------|-----------------| | `hikaricp_connections_pending` | >0 for >30s → pool exhaustion | | `hikaricp_connection_timeout_total` | Any increment → pool exhaustion | | `http_server_requests_seconds_max` | >30s → OCL timeout or slow validation | | `jvm_memory_used_bytes / jvm_memory_max_bytes` | >85% → OOM risk, increase container memory | | `process_uptime_seconds` | Resets → unexpected container restart | --- ## IG upgrade procedure When BD Core IG advances from v0.2.1 to v0.3.0: ```bash # 1. On CI machine: place new package.tgz in src/main/resources/packages/ cp bd.gov.dghs.core-0.3.0.tgz hapi-overlay/src/main/resources/packages/ # 2. Remove old package (one IG version per image) rm hapi-overlay/src/main/resources/packages/bd.gov.dghs.core-0.2.1.tgz # 3. Update application.yaml / docker-compose env vars: # HAPI_IG_PACKAGE_CLASSPATH=classpath:packages/bd.gov.dghs.core-0.3.0.tgz # HAPI_IG_VERSION=0.3.0 # 4. Build and push new image docker build \ --build-arg IG_PACKAGE=bd.gov.dghs.core-0.3.0.tgz \ --build-arg BUILD_VERSION=1.1.0 \ --build-arg GIT_COMMIT=$(git rev-parse --short HEAD) \ -t your-registry.dghs.gov.bd/bd-fhir-hapi:1.1.0 \ -f hapi-overlay/Dockerfile . docker push your-registry.dghs.gov.bd/bd-fhir-hapi:1.1.0 # 5. Update HAPI_IMAGE in .env on production server # 6. Rolling redeploy docker-compose --env-file .env pull hapi docker-compose --env-file .env up -d --no-deps hapi # 7. Verify new IG version is active curl -s https://fhir.dghs.gov.bd/fhir/metadata | jq '.software.version' # Expected: "0.3.0" or the configured HAPI_IG_VERSION value ``` **Vendor notification:** IG upgrades that change SHALL constraints require vendor notification at least 30 days in advance. Vendors must test against the staging environment before production deployment.