Files
bd-fhir-national/ops/scaling-roadmap.md
2026-03-16 00:02:58 +06:00

10 KiB
Raw Blame History

Scaling Roadmap — BD FHIR National

Audience: DGHS infrastructure team, future system architects
Current phase: Pilot (Phase 1)


Phase thresholds

Metric Phase 1 (Pilot) Phase 2 (Regional) Phase 3 (National)
Vendors <50 <500 >500
Resources/day <10,000 <100,000 >1,000,000
Resources total <1M <10M >10M
HAPI replicas 1 3 5-10+
Orchestrator docker-compose docker-compose Kubernetes
PostgreSQL Single instance Primary + replica Patroni HA cluster
Estimated trigger Now 6-18 months 18-36 months

Phase 1 → Phase 2 changes

1. Scale HAPI replicas to 3

No configuration changes required — the architecture was designed for this from day one.

# On the production Ubuntu server
cd /opt/bd-fhir-national
docker-compose --env-file .env up -d --scale hapi=3

Verify after scaling:

# All 3 replicas healthy
docker-compose ps hapi

# nginx is load balancing across all 3
# (check HAPI logs — requests should appear in all replica logs)
docker-compose logs --tail=50 hapi

# pgBouncer pool has sufficient capacity
# 3 replicas × 5 HikariCP connections = 15 connections
# pgBouncer pool_size=20 — 5 headroom remaining. Acceptable.

pgBouncer adjustment at 3+ replicas: At 5 replicas (5 × 5 = 25 connections), the current pgBouncer pool_size=20 becomes a bottleneck. Update docker-compose.yml:

# pgbouncer-fhir environment:
PGBOUNCER_DEFAULT_POOL_SIZE: "30"   # was 20
# And increase postgres-fhir max_connections in postgresql.conf:
# max_connections = 40              # was 30

2. Add PostgreSQL streaming replication (read replica)

For read-heavy workloads (FHIR search, bulk export), add a read replica. HAPI supports separate read and write datasource URLs.

# Add to docker-compose.yml:
postgres-fhir-replica:
  image: postgres:15-alpine
  environment:
    POSTGRES_DB: fhirdb
    PGUSER: replicator
    POSTGRES_PASSWORD: ${FHIR_REPLICA_PASSWORD}
  volumes:
    - postgres-fhir-replica-data:/var/lib/postgresql/data
    - ./postgres/fhir/replica.conf:/etc/postgresql/postgresql.conf:ro
  command: >
    bash -c "
    until pg_basebackup -h postgres-fhir -U replicator -D /var/lib/postgresql/data -P -Xs -R; do
      sleep 5;
    done && postgres -c config_file=/etc/postgresql/postgresql.conf"
  networks:
    - backend-fhir

Add HAPI_DATASOURCE_READ_URL environment variable pointing to the replica, and update DataSourceConfig.java to configure a separate read datasource.

3. Add Redis for distributed JWKS cache

Currently each HAPI replica maintains an independent in-memory JWKS cache. At 3 replicas, a Keycloak key rotation triggers 3 independent JWKS re-fetches within the same second. This is acceptable. At 10+ replicas, add Redis for a shared JWKS cache to reduce Keycloak load.

# Add to docker-compose.yml:
redis:
  image: redis:7-alpine
  networks:
    - frontend
    - backend-fhir
  command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD}

Update KeycloakJwtInterceptor to use Spring Cache with Redis backend for JWKS storage.


Phase 2 → Phase 3 changes

Move to Kubernetes

At national scale, docker-compose is not the correct orchestrator. Kubernetes provides:

  • Horizontal Pod Autoscaler (scale on CPU/RPS automatically)
  • Rolling deployments (zero-downtime IG version upgrades)
  • Pod Disruption Budgets (maintain minimum replicas during node maintenance)
  • Namespace isolation (separate FHIR, audit, monitoring namespaces)

Kubernetes equivalents:

docker-compose service Kubernetes resource
hapi (--scale N) Deployment + HPA
postgres-fhir StatefulSet (or external Patroni)
postgres-audit StatefulSet (or external Patroni)
pgbouncer-fhir Deployment (sidecar or standalone)
nginx Ingress (nginx-ingress-controller)

Partition HAPI JPA tables

At 5M+ resources in HFJ_RESOURCE, evaluate partitioning (see V1 migration comments).

Prerequisites before partitioning HAPI JPA tables:

  1. HAPI must be stopped during the migration (ALTER TABLE is not online in PostgreSQL 15)
  2. Foreign key references to HFJ_RESOURCE from all SPIDX tables must be updated
  3. The partition key must be included in all primary keys
  4. Hibernate DDL validation must be disabled during migration, then re-enabled

This is a planned maintenance window operation — minimum 4-hour downtime window for a database with 5M resources. At 10,000 resources/day, you have approximately 18 months to plan this migration from initial deployment.

Trigger: Run EXPLAIN ANALYZE on a representative FHIR search query. When sequential scans on HFJ_RESOURCE appear in the plan despite indexes, partitioning is overdue.


Partition maintenance — monthly cron job

The audit tables are partitioned by month with partitions pre-created through 2027. A missing partition causes INSERT to fail with a hard error — no graceful degradation.

Setup (run once on the audit PostgreSQL host)

# Create a login user for the maintenance function
# (audit_maintainer_login was created by postgres/audit/init.sql)

# Add to crontab on the Ubuntu host (or in a scheduled container):
crontab -e

# Run on the 20th of each month at 00:00 UTC — creates next month's partition
0 0 20 * * docker exec bd-postgres-audit psql \
  -U audit_maintainer_login \
  -d auditdb \
  -c "SELECT audit.create_next_month_partitions();" \
  >> /var/log/bd-fhir-partition-maintenance.log 2>&1

Verify partition creation

# After the cron runs, verify the new partition exists
docker exec bd-postgres-audit psql -U postgres -d auditdb -c "
SELECT
    c.relname AS partition_name,
    pg_get_expr(c.relpartbound, c.oid) AS partition_range
FROM pg_class c
JOIN pg_inherits i ON i.inhrelid = c.oid
JOIN pg_class p ON p.oid = i.inhparent
JOIN pg_namespace n ON n.oid = p.relnamespace
WHERE n.nspname = 'audit'
  AND p.relname = 'audit_events'
ORDER BY c.relname DESC
LIMIT 3;
"
# Should show the three most recent monthly partitions

Monitor for missing partitions

Add this check to your monitoring system (Prometheus alerting or cron):

#!/bin/bash
# check_audit_partitions.sh
# Alert if the next month's partition does not exist by the 25th

NEXT_MONTH=$(date -d "+1 month" +%Y_%m)
PARTITION="audit_events_${NEXT_MONTH}"

RESULT=$(docker exec bd-postgres-audit psql -U postgres -d auditdb -tAc "
SELECT COUNT(*) FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE n.nspname = 'audit' AND c.relname = '${PARTITION}';")

if [ "$RESULT" -eq "0" ]; then
  echo "ALERT: Missing audit partition for next month: ${PARTITION}"
  # Send to your alerting system (PagerDuty, Slack, email)
  exit 1
fi
echo "OK: Partition ${PARTITION} exists"
exit 0

Monitoring — key metrics to track

These metrics indicate when scaling actions are needed.

PostgreSQL — fhir

-- Connection utilisation (should be <80% of max_connections)
SELECT count(*) as active_connections,
       max_conn,
       round(100.0 * count(*) / max_conn, 1) as utilisation_pct
FROM pg_stat_activity, (SELECT setting::int as max_conn FROM pg_settings WHERE name='max_connections') mc
WHERE state = 'active'
GROUP BY max_conn;

-- Table bloat (trigger VACUUM if dead_tuple_ratio > 10%)
SELECT relname, n_live_tup, n_dead_tup,
       round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) as dead_pct
FROM pg_stat_user_tables
WHERE relname IN ('hfj_resource', 'hfj_spidx_token', 'hfj_res_ver')
ORDER BY dead_pct DESC;

-- Index usage (trigger REINDEX if idx_scan is 0 for a non-new index)
SELECT relname, indexrelname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
WHERE relname LIKE 'hfj_%'
ORDER BY idx_scan ASC
LIMIT 10;

PostgreSQL — audit

-- Partition sizes (plan next archive when any partition exceeds 10GB)
SELECT
    c.relname as partition,
    pg_size_pretty(pg_relation_size(c.oid)) as size
FROM pg_class c
JOIN pg_inherits i ON i.inhrelid = c.oid
JOIN pg_class p ON p.oid = i.inhparent
JOIN pg_namespace n ON n.oid = p.relnamespace
WHERE n.nspname = 'audit' AND p.relname = 'audit_events'
ORDER BY c.relname DESC;

-- Rejection rate by vendor (flag vendors with >10% rejection rate)
SELECT
    client_id,
    COUNT(*) as total_events,
    SUM(CASE WHEN outcome = 'REJECTED' THEN 1 ELSE 0 END) as rejections,
    ROUND(100.0 * SUM(CASE WHEN outcome = 'REJECTED' THEN 1 ELSE 0 END) / COUNT(*), 1) as rejection_pct
FROM audit.audit_events
WHERE event_time > NOW() - INTERVAL '7 days'
  AND event_type IN ('OPERATION', 'VALIDATION_FAILURE')
GROUP BY client_id
ORDER BY rejection_pct DESC;

HAPI — Prometheus metrics

Key metrics exposed at /actuator/prometheus:

Metric Alert threshold
hikaricp_connections_pending >0 for >30s → pool exhaustion
hikaricp_connection_timeout_total Any increment → pool exhaustion
http_server_requests_seconds_max >30s → OCL timeout or slow validation
jvm_memory_used_bytes / jvm_memory_max_bytes >85% → OOM risk, increase container memory
process_uptime_seconds Resets → unexpected container restart

IG upgrade procedure

When BD Core IG advances from v0.2.1 to v0.3.0:

# 1. On CI machine: place new package.tgz in src/main/resources/packages/
cp bd.gov.dghs.core-0.3.0.tgz hapi-overlay/src/main/resources/packages/

# 2. Remove old package (one IG version per image)
rm hapi-overlay/src/main/resources/packages/bd.gov.dghs.core-0.2.1.tgz

# 3. Update application.yaml / docker-compose env vars:
#    HAPI_IG_PACKAGE_CLASSPATH=classpath:packages/bd.gov.dghs.core-0.3.0.tgz
#    HAPI_IG_VERSION=0.3.0

# 4. Build and push new image
docker build \
  --build-arg IG_PACKAGE=bd.gov.dghs.core-0.3.0.tgz \
  --build-arg BUILD_VERSION=1.1.0 \
  --build-arg GIT_COMMIT=$(git rev-parse --short HEAD) \
  -t your-registry.dghs.gov.bd/bd-fhir-hapi:1.1.0 \
  -f hapi-overlay/Dockerfile .

docker push your-registry.dghs.gov.bd/bd-fhir-hapi:1.1.0

# 5. Update HAPI_IMAGE in .env on production server
# 6. Rolling redeploy
docker-compose --env-file .env pull hapi
docker-compose --env-file .env up -d --no-deps hapi

# 7. Verify new IG version is active
curl -s https://fhir.dghs.gov.bd/fhir/metadata | jq '.software.version'
# Expected: "0.3.0" or the configured HAPI_IG_VERSION value

Vendor notification: IG upgrades that change SHALL constraints require vendor notification at least 30 days in advance. Vendors must test against the staging environment before production deployment.