Files
bd-fhir-national/ops/scaling-roadmap.md
2026-03-16 00:02:58 +06:00

328 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Scaling Roadmap — BD FHIR National
**Audience:** DGHS infrastructure team, future system architects
**Current phase:** Pilot (Phase 1)
---
## Phase thresholds
| Metric | Phase 1 (Pilot) | Phase 2 (Regional) | Phase 3 (National) |
|--------|-----------------|---------------------|---------------------|
| Vendors | <50 | <500 | >500 |
| Resources/day | <10,000 | <100,000 | >1,000,000 |
| Resources total | <1M | <10M | >10M |
| HAPI replicas | 1 | 3 | 5-10+ |
| Orchestrator | docker-compose | docker-compose | Kubernetes |
| PostgreSQL | Single instance | Primary + replica | Patroni HA cluster |
| Estimated trigger | Now | 6-18 months | 18-36 months |
---
## Phase 1 → Phase 2 changes
### 1. Scale HAPI replicas to 3
No configuration changes required — the architecture was designed for this from day one.
```bash
# On the production Ubuntu server
cd /opt/bd-fhir-national
docker-compose --env-file .env up -d --scale hapi=3
```
**Verify after scaling:**
```bash
# All 3 replicas healthy
docker-compose ps hapi
# nginx is load balancing across all 3
# (check HAPI logs — requests should appear in all replica logs)
docker-compose logs --tail=50 hapi
# pgBouncer pool has sufficient capacity
# 3 replicas × 5 HikariCP connections = 15 connections
# pgBouncer pool_size=20 — 5 headroom remaining. Acceptable.
```
**pgBouncer adjustment at 3+ replicas:**
At 5 replicas (5 × 5 = 25 connections), the current pgBouncer pool_size=20 becomes
a bottleneck. Update docker-compose.yml:
```yaml
# pgbouncer-fhir environment:
PGBOUNCER_DEFAULT_POOL_SIZE: "30" # was 20
# And increase postgres-fhir max_connections in postgresql.conf:
# max_connections = 40 # was 30
```
### 2. Add PostgreSQL streaming replication (read replica)
For read-heavy workloads (FHIR search, bulk export), add a read replica.
HAPI supports separate read and write datasource URLs.
```yaml
# Add to docker-compose.yml:
postgres-fhir-replica:
image: postgres:15-alpine
environment:
POSTGRES_DB: fhirdb
PGUSER: replicator
POSTGRES_PASSWORD: ${FHIR_REPLICA_PASSWORD}
volumes:
- postgres-fhir-replica-data:/var/lib/postgresql/data
- ./postgres/fhir/replica.conf:/etc/postgresql/postgresql.conf:ro
command: >
bash -c "
until pg_basebackup -h postgres-fhir -U replicator -D /var/lib/postgresql/data -P -Xs -R; do
sleep 5;
done && postgres -c config_file=/etc/postgresql/postgresql.conf"
networks:
- backend-fhir
```
Add `HAPI_DATASOURCE_READ_URL` environment variable pointing to the replica,
and update `DataSourceConfig.java` to configure a separate read datasource.
### 3. Add Redis for distributed JWKS cache
Currently each HAPI replica maintains an independent in-memory JWKS cache.
At 3 replicas, a Keycloak key rotation triggers 3 independent JWKS re-fetches
within the same second. This is acceptable. At 10+ replicas, add Redis for
a shared JWKS cache to reduce Keycloak load.
```yaml
# Add to docker-compose.yml:
redis:
image: redis:7-alpine
networks:
- frontend
- backend-fhir
command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD}
```
Update `KeycloakJwtInterceptor` to use Spring Cache with Redis backend for JWKS storage.
---
## Phase 2 → Phase 3 changes
### Move to Kubernetes
At national scale, docker-compose is not the correct orchestrator. Kubernetes
provides:
- Horizontal Pod Autoscaler (scale on CPU/RPS automatically)
- Rolling deployments (zero-downtime IG version upgrades)
- Pod Disruption Budgets (maintain minimum replicas during node maintenance)
- Namespace isolation (separate FHIR, audit, monitoring namespaces)
**Kubernetes equivalents:**
| docker-compose service | Kubernetes resource |
|------------------------|---------------------|
| hapi (--scale N) | Deployment + HPA |
| postgres-fhir | StatefulSet (or external Patroni) |
| postgres-audit | StatefulSet (or external Patroni) |
| pgbouncer-fhir | Deployment (sidecar or standalone) |
| nginx | Ingress (nginx-ingress-controller) |
### Partition HAPI JPA tables
At 5M+ resources in `HFJ_RESOURCE`, evaluate partitioning (see V1 migration comments).
Prerequisites before partitioning HAPI JPA tables:
1. HAPI must be stopped during the migration (ALTER TABLE is not online in PostgreSQL 15)
2. Foreign key references to HFJ_RESOURCE from all SPIDX tables must be updated
3. The partition key must be included in all primary keys
4. Hibernate DDL validation must be disabled during migration, then re-enabled
This is a planned maintenance window operation — minimum 4-hour downtime window
for a database with 5M resources. At 10,000 resources/day, you have
approximately 18 months to plan this migration from initial deployment.
**Trigger:** Run `EXPLAIN ANALYZE` on a representative FHIR search query.
When sequential scans on HFJ_RESOURCE appear in the plan despite indexes,
partitioning is overdue.
---
## Partition maintenance — monthly cron job
The audit tables are partitioned by month with partitions pre-created through 2027.
**A missing partition causes INSERT to fail with a hard error** — no graceful degradation.
### Setup (run once on the audit PostgreSQL host)
```bash
# Create a login user for the maintenance function
# (audit_maintainer_login was created by postgres/audit/init.sql)
# Add to crontab on the Ubuntu host (or in a scheduled container):
crontab -e
# Run on the 20th of each month at 00:00 UTC — creates next month's partition
0 0 20 * * docker exec bd-postgres-audit psql \
-U audit_maintainer_login \
-d auditdb \
-c "SELECT audit.create_next_month_partitions();" \
>> /var/log/bd-fhir-partition-maintenance.log 2>&1
```
### Verify partition creation
```bash
# After the cron runs, verify the new partition exists
docker exec bd-postgres-audit psql -U postgres -d auditdb -c "
SELECT
c.relname AS partition_name,
pg_get_expr(c.relpartbound, c.oid) AS partition_range
FROM pg_class c
JOIN pg_inherits i ON i.inhrelid = c.oid
JOIN pg_class p ON p.oid = i.inhparent
JOIN pg_namespace n ON n.oid = p.relnamespace
WHERE n.nspname = 'audit'
AND p.relname = 'audit_events'
ORDER BY c.relname DESC
LIMIT 3;
"
# Should show the three most recent monthly partitions
```
### Monitor for missing partitions
Add this check to your monitoring system (Prometheus alerting or cron):
```bash
#!/bin/bash
# check_audit_partitions.sh
# Alert if the next month's partition does not exist by the 25th
NEXT_MONTH=$(date -d "+1 month" +%Y_%m)
PARTITION="audit_events_${NEXT_MONTH}"
RESULT=$(docker exec bd-postgres-audit psql -U postgres -d auditdb -tAc "
SELECT COUNT(*) FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE n.nspname = 'audit' AND c.relname = '${PARTITION}';")
if [ "$RESULT" -eq "0" ]; then
echo "ALERT: Missing audit partition for next month: ${PARTITION}"
# Send to your alerting system (PagerDuty, Slack, email)
exit 1
fi
echo "OK: Partition ${PARTITION} exists"
exit 0
```
---
## Monitoring — key metrics to track
These metrics indicate when scaling actions are needed.
### PostgreSQL — fhir
```sql
-- Connection utilisation (should be <80% of max_connections)
SELECT count(*) as active_connections,
max_conn,
round(100.0 * count(*) / max_conn, 1) as utilisation_pct
FROM pg_stat_activity, (SELECT setting::int as max_conn FROM pg_settings WHERE name='max_connections') mc
WHERE state = 'active'
GROUP BY max_conn;
-- Table bloat (trigger VACUUM if dead_tuple_ratio > 10%)
SELECT relname, n_live_tup, n_dead_tup,
round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) as dead_pct
FROM pg_stat_user_tables
WHERE relname IN ('hfj_resource', 'hfj_spidx_token', 'hfj_res_ver')
ORDER BY dead_pct DESC;
-- Index usage (trigger REINDEX if idx_scan is 0 for a non-new index)
SELECT relname, indexrelname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
WHERE relname LIKE 'hfj_%'
ORDER BY idx_scan ASC
LIMIT 10;
```
### PostgreSQL — audit
```sql
-- Partition sizes (plan next archive when any partition exceeds 10GB)
SELECT
c.relname as partition,
pg_size_pretty(pg_relation_size(c.oid)) as size
FROM pg_class c
JOIN pg_inherits i ON i.inhrelid = c.oid
JOIN pg_class p ON p.oid = i.inhparent
JOIN pg_namespace n ON n.oid = p.relnamespace
WHERE n.nspname = 'audit' AND p.relname = 'audit_events'
ORDER BY c.relname DESC;
-- Rejection rate by vendor (flag vendors with >10% rejection rate)
SELECT
client_id,
COUNT(*) as total_events,
SUM(CASE WHEN outcome = 'REJECTED' THEN 1 ELSE 0 END) as rejections,
ROUND(100.0 * SUM(CASE WHEN outcome = 'REJECTED' THEN 1 ELSE 0 END) / COUNT(*), 1) as rejection_pct
FROM audit.audit_events
WHERE event_time > NOW() - INTERVAL '7 days'
AND event_type IN ('OPERATION', 'VALIDATION_FAILURE')
GROUP BY client_id
ORDER BY rejection_pct DESC;
```
### HAPI — Prometheus metrics
Key metrics exposed at `/actuator/prometheus`:
| Metric | Alert threshold |
|--------|-----------------|
| `hikaricp_connections_pending` | >0 for >30s → pool exhaustion |
| `hikaricp_connection_timeout_total` | Any increment → pool exhaustion |
| `http_server_requests_seconds_max` | >30s → OCL timeout or slow validation |
| `jvm_memory_used_bytes / jvm_memory_max_bytes` | >85% → OOM risk, increase container memory |
| `process_uptime_seconds` | Resets → unexpected container restart |
---
## IG upgrade procedure
When BD Core IG advances from v0.2.1 to v0.3.0:
```bash
# 1. On CI machine: place new package.tgz in src/main/resources/packages/
cp bd.gov.dghs.core-0.3.0.tgz hapi-overlay/src/main/resources/packages/
# 2. Remove old package (one IG version per image)
rm hapi-overlay/src/main/resources/packages/bd.gov.dghs.core-0.2.1.tgz
# 3. Update application.yaml / docker-compose env vars:
# HAPI_IG_PACKAGE_CLASSPATH=classpath:packages/bd.gov.dghs.core-0.3.0.tgz
# HAPI_IG_VERSION=0.3.0
# 4. Build and push new image
docker build \
--build-arg IG_PACKAGE=bd.gov.dghs.core-0.3.0.tgz \
--build-arg BUILD_VERSION=1.1.0 \
--build-arg GIT_COMMIT=$(git rev-parse --short HEAD) \
-t your-registry.dghs.gov.bd/bd-fhir-hapi:1.1.0 \
-f hapi-overlay/Dockerfile .
docker push your-registry.dghs.gov.bd/bd-fhir-hapi:1.1.0
# 5. Update HAPI_IMAGE in .env on production server
# 6. Rolling redeploy
docker-compose --env-file .env pull hapi
docker-compose --env-file .env up -d --no-deps hapi
# 7. Verify new IG version is active
curl -s https://fhir.dghs.gov.bd/fhir/metadata | jq '.software.version'
# Expected: "0.3.0" or the configured HAPI_IG_VERSION value
```
**Vendor notification:** IG upgrades that change SHALL constraints require
vendor notification at least 30 days in advance. Vendors must test against
the staging environment before production deployment.