riaz.somc/bd-fhir-national

Fork 0

Files

Dr. B. M. Riazul Islam c11f0bd5bc first commit

2026-03-16 00:02:58 +06:00

10 KiB

Raw Blame History

Scaling Roadmap — BD FHIR National

Audience: DGHS infrastructure team, future system architects
Current phase: Pilot (Phase 1)

Phase thresholds

Metric	Phase 1 (Pilot)	Phase 2 (Regional)	Phase 3 (National)
Vendors	<50	<500	>500
Resources/day	<10,000	<100,000	>1,000,000
Resources total	<1M	<10M	>10M
HAPI replicas	1	3	5-10+
Orchestrator	docker-compose	docker-compose	Kubernetes
PostgreSQL	Single instance	Primary + replica	Patroni HA cluster
Estimated trigger	Now	6-18 months	18-36 months

Phase 1 → Phase 2 changes

1. Scale HAPI replicas to 3

No configuration changes required — the architecture was designed for this from day one.

# On the production Ubuntu server
cd /opt/bd-fhir-national
docker-compose --env-file .env up -d --scale hapi=3

Verify after scaling:

# All 3 replicas healthy
docker-compose ps hapi

# nginx is load balancing across all 3
# (check HAPI logs — requests should appear in all replica logs)
docker-compose logs --tail=50 hapi

# pgBouncer pool has sufficient capacity
# 3 replicas × 5 HikariCP connections = 15 connections
# pgBouncer pool_size=20 — 5 headroom remaining. Acceptable.

pgBouncer adjustment at 3+ replicas: At 5 replicas (5 × 5 = 25 connections), the current pgBouncer pool_size=20 becomes a bottleneck. Update docker-compose.yml:

# pgbouncer-fhir environment:
PGBOUNCER_DEFAULT_POOL_SIZE: "30"   # was 20
# And increase postgres-fhir max_connections in postgresql.conf:
# max_connections = 40              # was 30

2. Add PostgreSQL streaming replication (read replica)

For read-heavy workloads (FHIR search, bulk export), add a read replica. HAPI supports separate read and write datasource URLs.

# Add to docker-compose.yml:
postgres-fhir-replica:
  image: postgres:15-alpine
  environment:
    POSTGRES_DB: fhirdb
    PGUSER: replicator
    POSTGRES_PASSWORD: ${FHIR_REPLICA_PASSWORD}
  volumes:
    - postgres-fhir-replica-data:/var/lib/postgresql/data
    - ./postgres/fhir/replica.conf:/etc/postgresql/postgresql.conf:ro
  command: >
    bash -c "
    until pg_basebackup -h postgres-fhir -U replicator -D /var/lib/postgresql/data -P -Xs -R; do
      sleep 5;
    done && postgres -c config_file=/etc/postgresql/postgresql.conf"
  networks:
    - backend-fhir

Add HAPI_DATASOURCE_READ_URL environment variable pointing to the replica, and update DataSourceConfig.java to configure a separate read datasource.

3. Add Redis for distributed JWKS cache

Currently each HAPI replica maintains an independent in-memory JWKS cache. At 3 replicas, a Keycloak key rotation triggers 3 independent JWKS re-fetches within the same second. This is acceptable. At 10+ replicas, add Redis for a shared JWKS cache to reduce Keycloak load.

# Add to docker-compose.yml:
redis:
  image: redis:7-alpine
  networks:
    - frontend
    - backend-fhir
  command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD}

Update KeycloakJwtInterceptor to use Spring Cache with Redis backend for JWKS storage.

Phase 2 → Phase 3 changes

Move to Kubernetes

At national scale, docker-compose is not the correct orchestrator. Kubernetes provides:

Horizontal Pod Autoscaler (scale on CPU/RPS automatically)
Rolling deployments (zero-downtime IG version upgrades)
Pod Disruption Budgets (maintain minimum replicas during node maintenance)
Namespace isolation (separate FHIR, audit, monitoring namespaces)

Kubernetes equivalents:

docker-compose service	Kubernetes resource
hapi (--scale N)	Deployment + HPA
postgres-fhir	StatefulSet (or external Patroni)
postgres-audit	StatefulSet (or external Patroni)
pgbouncer-fhir	Deployment (sidecar or standalone)
nginx	Ingress (nginx-ingress-controller)

Partition HAPI JPA tables

At 5M+ resources in HFJ_RESOURCE, evaluate partitioning (see V1 migration comments).

Prerequisites before partitioning HAPI JPA tables:

HAPI must be stopped during the migration (ALTER TABLE is not online in PostgreSQL 15)
Foreign key references to HFJ_RESOURCE from all SPIDX tables must be updated
The partition key must be included in all primary keys
Hibernate DDL validation must be disabled during migration, then re-enabled

This is a planned maintenance window operation — minimum 4-hour downtime window for a database with 5M resources. At 10,000 resources/day, you have approximately 18 months to plan this migration from initial deployment.

Trigger: Run EXPLAIN ANALYZE on a representative FHIR search query. When sequential scans on HFJ_RESOURCE appear in the plan despite indexes, partitioning is overdue.

Partition maintenance — monthly cron job

The audit tables are partitioned by month with partitions pre-created through 2027. A missing partition causes INSERT to fail with a hard error — no graceful degradation.

Setup (run once on the audit PostgreSQL host)

# Create a login user for the maintenance function
# (audit_maintainer_login was created by postgres/audit/init.sql)

# Add to crontab on the Ubuntu host (or in a scheduled container):
crontab -e

# Run on the 20th of each month at 00:00 UTC — creates next month's partition
0 0 20 * * docker exec bd-postgres-audit psql \
  -U audit_maintainer_login \
  -d auditdb \
  -c "SELECT audit.create_next_month_partitions();" \
  >> /var/log/bd-fhir-partition-maintenance.log 2>&1

Verify partition creation

# After the cron runs, verify the new partition exists
docker exec bd-postgres-audit psql -U postgres -d auditdb -c "
SELECT
    c.relname AS partition_name,
    pg_get_expr(c.relpartbound, c.oid) AS partition_range
FROM pg_class c
JOIN pg_inherits i ON i.inhrelid = c.oid
JOIN pg_class p ON p.oid = i.inhparent
JOIN pg_namespace n ON n.oid = p.relnamespace
WHERE n.nspname = 'audit'
  AND p.relname = 'audit_events'
ORDER BY c.relname DESC
LIMIT 3;
"
# Should show the three most recent monthly partitions

Monitor for missing partitions

Add this check to your monitoring system (Prometheus alerting or cron):

#!/bin/bash
# check_audit_partitions.sh
# Alert if the next month's partition does not exist by the 25th

NEXT_MONTH=$(date -d "+1 month" +%Y_%m)
PARTITION="audit_events_${NEXT_MONTH}"

RESULT=$(docker exec bd-postgres-audit psql -U postgres -d auditdb -tAc "
SELECT COUNT(*) FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE n.nspname = 'audit' AND c.relname = '${PARTITION}';")

if [ "$RESULT" -eq "0" ]; then
  echo "ALERT: Missing audit partition for next month: ${PARTITION}"
  # Send to your alerting system (PagerDuty, Slack, email)
  exit 1
fi
echo "OK: Partition ${PARTITION} exists"
exit 0

Monitoring — key metrics to track

These metrics indicate when scaling actions are needed.

PostgreSQL — fhir

-- Connection utilisation (should be <80% of max_connections)
SELECT count(*) as active_connections,
       max_conn,
       round(100.0 * count(*) / max_conn, 1) as utilisation_pct
FROM pg_stat_activity, (SELECT setting::int as max_conn FROM pg_settings WHERE name='max_connections') mc
WHERE state = 'active'
GROUP BY max_conn;

-- Table bloat (trigger VACUUM if dead_tuple_ratio > 10%)
SELECT relname, n_live_tup, n_dead_tup,
       round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) as dead_pct
FROM pg_stat_user_tables
WHERE relname IN ('hfj_resource', 'hfj_spidx_token', 'hfj_res_ver')
ORDER BY dead_pct DESC;

-- Index usage (trigger REINDEX if idx_scan is 0 for a non-new index)
SELECT relname, indexrelname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
WHERE relname LIKE 'hfj_%'
ORDER BY idx_scan ASC
LIMIT 10;

PostgreSQL — audit

-- Partition sizes (plan next archive when any partition exceeds 10GB)
SELECT
    c.relname as partition,
    pg_size_pretty(pg_relation_size(c.oid)) as size
FROM pg_class c
JOIN pg_inherits i ON i.inhrelid = c.oid
JOIN pg_class p ON p.oid = i.inhparent
JOIN pg_namespace n ON n.oid = p.relnamespace
WHERE n.nspname = 'audit' AND p.relname = 'audit_events'
ORDER BY c.relname DESC;

-- Rejection rate by vendor (flag vendors with >10% rejection rate)
SELECT
    client_id,
    COUNT(*) as total_events,
    SUM(CASE WHEN outcome = 'REJECTED' THEN 1 ELSE 0 END) as rejections,
    ROUND(100.0 * SUM(CASE WHEN outcome = 'REJECTED' THEN 1 ELSE 0 END) / COUNT(*), 1) as rejection_pct
FROM audit.audit_events
WHERE event_time > NOW() - INTERVAL '7 days'
  AND event_type IN ('OPERATION', 'VALIDATION_FAILURE')
GROUP BY client_id
ORDER BY rejection_pct DESC;

HAPI — Prometheus metrics

Key metrics exposed at /actuator/prometheus:

Metric	Alert threshold
`hikaricp_connections_pending`	>0 for >30s → pool exhaustion
`hikaricp_connection_timeout_total`	Any increment → pool exhaustion
`http_server_requests_seconds_max`	>30s → OCL timeout or slow validation
`jvm_memory_used_bytes / jvm_memory_max_bytes`	>85% → OOM risk, increase container memory
`process_uptime_seconds`	Resets → unexpected container restart

IG upgrade procedure

When BD Core IG advances from v0.2.1 to v0.3.0:

# 1. On CI machine: place new package.tgz in src/main/resources/packages/
cp bd.gov.dghs.core-0.3.0.tgz hapi-overlay/src/main/resources/packages/

# 2. Remove old package (one IG version per image)
rm hapi-overlay/src/main/resources/packages/bd.gov.dghs.core-0.2.1.tgz

# 3. Update application.yaml / docker-compose env vars:
#    HAPI_IG_PACKAGE_CLASSPATH=classpath:packages/bd.gov.dghs.core-0.3.0.tgz
#    HAPI_IG_VERSION=0.3.0

# 4. Build and push new image
docker build \
  --build-arg IG_PACKAGE=bd.gov.dghs.core-0.3.0.tgz \
  --build-arg BUILD_VERSION=1.1.0 \
  --build-arg GIT_COMMIT=$(git rev-parse --short HEAD) \
  -t your-registry.dghs.gov.bd/bd-fhir-hapi:1.1.0 \
  -f hapi-overlay/Dockerfile .

docker push your-registry.dghs.gov.bd/bd-fhir-hapi:1.1.0

# 5. Update HAPI_IMAGE in .env on production server
# 6. Rolling redeploy
docker-compose --env-file .env pull hapi
docker-compose --env-file .env up -d --no-deps hapi

# 7. Verify new IG version is active
curl -s https://fhir.dghs.gov.bd/fhir/metadata | jq '.software.version'
# Expected: "0.3.0" or the configured HAPI_IG_VERSION value

Vendor notification: IG upgrades that change SHALL constraints require vendor notification at least 30 days in advance. Vendors must test against the staging environment before production deployment.

10 KiB Raw Blame History Unescape Escape