first commit
This commit is contained in:
327
ops/scaling-roadmap.md
Normal file
327
ops/scaling-roadmap.md
Normal file
@@ -0,0 +1,327 @@
|
||||
# Scaling Roadmap — BD FHIR National
|
||||
|
||||
**Audience:** DGHS infrastructure team, future system architects
|
||||
**Current phase:** Pilot (Phase 1)
|
||||
|
||||
---
|
||||
|
||||
## Phase thresholds
|
||||
|
||||
| Metric | Phase 1 (Pilot) | Phase 2 (Regional) | Phase 3 (National) |
|
||||
|--------|-----------------|---------------------|---------------------|
|
||||
| Vendors | <50 | <500 | >500 |
|
||||
| Resources/day | <10,000 | <100,000 | >1,000,000 |
|
||||
| Resources total | <1M | <10M | >10M |
|
||||
| HAPI replicas | 1 | 3 | 5-10+ |
|
||||
| Orchestrator | docker-compose | docker-compose | Kubernetes |
|
||||
| PostgreSQL | Single instance | Primary + replica | Patroni HA cluster |
|
||||
| Estimated trigger | Now | 6-18 months | 18-36 months |
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 → Phase 2 changes
|
||||
|
||||
### 1. Scale HAPI replicas to 3
|
||||
|
||||
No configuration changes required — the architecture was designed for this from day one.
|
||||
|
||||
```bash
|
||||
# On the production Ubuntu server
|
||||
cd /opt/bd-fhir-national
|
||||
docker-compose --env-file .env up -d --scale hapi=3
|
||||
```
|
||||
|
||||
**Verify after scaling:**
|
||||
```bash
|
||||
# All 3 replicas healthy
|
||||
docker-compose ps hapi
|
||||
|
||||
# nginx is load balancing across all 3
|
||||
# (check HAPI logs — requests should appear in all replica logs)
|
||||
docker-compose logs --tail=50 hapi
|
||||
|
||||
# pgBouncer pool has sufficient capacity
|
||||
# 3 replicas × 5 HikariCP connections = 15 connections
|
||||
# pgBouncer pool_size=20 — 5 headroom remaining. Acceptable.
|
||||
```
|
||||
|
||||
**pgBouncer adjustment at 3+ replicas:**
|
||||
At 5 replicas (5 × 5 = 25 connections), the current pgBouncer pool_size=20 becomes
|
||||
a bottleneck. Update docker-compose.yml:
|
||||
|
||||
```yaml
|
||||
# pgbouncer-fhir environment:
|
||||
PGBOUNCER_DEFAULT_POOL_SIZE: "30" # was 20
|
||||
# And increase postgres-fhir max_connections in postgresql.conf:
|
||||
# max_connections = 40 # was 30
|
||||
```
|
||||
|
||||
### 2. Add PostgreSQL streaming replication (read replica)
|
||||
|
||||
For read-heavy workloads (FHIR search, bulk export), add a read replica.
|
||||
HAPI supports separate read and write datasource URLs.
|
||||
|
||||
```yaml
|
||||
# Add to docker-compose.yml:
|
||||
postgres-fhir-replica:
|
||||
image: postgres:15-alpine
|
||||
environment:
|
||||
POSTGRES_DB: fhirdb
|
||||
PGUSER: replicator
|
||||
POSTGRES_PASSWORD: ${FHIR_REPLICA_PASSWORD}
|
||||
volumes:
|
||||
- postgres-fhir-replica-data:/var/lib/postgresql/data
|
||||
- ./postgres/fhir/replica.conf:/etc/postgresql/postgresql.conf:ro
|
||||
command: >
|
||||
bash -c "
|
||||
until pg_basebackup -h postgres-fhir -U replicator -D /var/lib/postgresql/data -P -Xs -R; do
|
||||
sleep 5;
|
||||
done && postgres -c config_file=/etc/postgresql/postgresql.conf"
|
||||
networks:
|
||||
- backend-fhir
|
||||
```
|
||||
|
||||
Add `HAPI_DATASOURCE_READ_URL` environment variable pointing to the replica,
|
||||
and update `DataSourceConfig.java` to configure a separate read datasource.
|
||||
|
||||
### 3. Add Redis for distributed JWKS cache
|
||||
|
||||
Currently each HAPI replica maintains an independent in-memory JWKS cache.
|
||||
At 3 replicas, a Keycloak key rotation triggers 3 independent JWKS re-fetches
|
||||
within the same second. This is acceptable. At 10+ replicas, add Redis for
|
||||
a shared JWKS cache to reduce Keycloak load.
|
||||
|
||||
```yaml
|
||||
# Add to docker-compose.yml:
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
networks:
|
||||
- frontend
|
||||
- backend-fhir
|
||||
command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD}
|
||||
```
|
||||
|
||||
Update `KeycloakJwtInterceptor` to use Spring Cache with Redis backend for JWKS storage.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 → Phase 3 changes
|
||||
|
||||
### Move to Kubernetes
|
||||
|
||||
At national scale, docker-compose is not the correct orchestrator. Kubernetes
|
||||
provides:
|
||||
- Horizontal Pod Autoscaler (scale on CPU/RPS automatically)
|
||||
- Rolling deployments (zero-downtime IG version upgrades)
|
||||
- Pod Disruption Budgets (maintain minimum replicas during node maintenance)
|
||||
- Namespace isolation (separate FHIR, audit, monitoring namespaces)
|
||||
|
||||
**Kubernetes equivalents:**
|
||||
|
||||
| docker-compose service | Kubernetes resource |
|
||||
|------------------------|---------------------|
|
||||
| hapi (--scale N) | Deployment + HPA |
|
||||
| postgres-fhir | StatefulSet (or external Patroni) |
|
||||
| postgres-audit | StatefulSet (or external Patroni) |
|
||||
| pgbouncer-fhir | Deployment (sidecar or standalone) |
|
||||
| nginx | Ingress (nginx-ingress-controller) |
|
||||
|
||||
### Partition HAPI JPA tables
|
||||
|
||||
At 5M+ resources in `HFJ_RESOURCE`, evaluate partitioning (see V1 migration comments).
|
||||
|
||||
Prerequisites before partitioning HAPI JPA tables:
|
||||
1. HAPI must be stopped during the migration (ALTER TABLE is not online in PostgreSQL 15)
|
||||
2. Foreign key references to HFJ_RESOURCE from all SPIDX tables must be updated
|
||||
3. The partition key must be included in all primary keys
|
||||
4. Hibernate DDL validation must be disabled during migration, then re-enabled
|
||||
|
||||
This is a planned maintenance window operation — minimum 4-hour downtime window
|
||||
for a database with 5M resources. At 10,000 resources/day, you have
|
||||
approximately 18 months to plan this migration from initial deployment.
|
||||
|
||||
**Trigger:** Run `EXPLAIN ANALYZE` on a representative FHIR search query.
|
||||
When sequential scans on HFJ_RESOURCE appear in the plan despite indexes,
|
||||
partitioning is overdue.
|
||||
|
||||
---
|
||||
|
||||
## Partition maintenance — monthly cron job
|
||||
|
||||
The audit tables are partitioned by month with partitions pre-created through 2027.
|
||||
**A missing partition causes INSERT to fail with a hard error** — no graceful degradation.
|
||||
|
||||
### Setup (run once on the audit PostgreSQL host)
|
||||
|
||||
```bash
|
||||
# Create a login user for the maintenance function
|
||||
# (audit_maintainer_login was created by postgres/audit/init.sql)
|
||||
|
||||
# Add to crontab on the Ubuntu host (or in a scheduled container):
|
||||
crontab -e
|
||||
|
||||
# Run on the 20th of each month at 00:00 UTC — creates next month's partition
|
||||
0 0 20 * * docker exec bd-postgres-audit psql \
|
||||
-U audit_maintainer_login \
|
||||
-d auditdb \
|
||||
-c "SELECT audit.create_next_month_partitions();" \
|
||||
>> /var/log/bd-fhir-partition-maintenance.log 2>&1
|
||||
```
|
||||
|
||||
### Verify partition creation
|
||||
|
||||
```bash
|
||||
# After the cron runs, verify the new partition exists
|
||||
docker exec bd-postgres-audit psql -U postgres -d auditdb -c "
|
||||
SELECT
|
||||
c.relname AS partition_name,
|
||||
pg_get_expr(c.relpartbound, c.oid) AS partition_range
|
||||
FROM pg_class c
|
||||
JOIN pg_inherits i ON i.inhrelid = c.oid
|
||||
JOIN pg_class p ON p.oid = i.inhparent
|
||||
JOIN pg_namespace n ON n.oid = p.relnamespace
|
||||
WHERE n.nspname = 'audit'
|
||||
AND p.relname = 'audit_events'
|
||||
ORDER BY c.relname DESC
|
||||
LIMIT 3;
|
||||
"
|
||||
# Should show the three most recent monthly partitions
|
||||
```
|
||||
|
||||
### Monitor for missing partitions
|
||||
|
||||
Add this check to your monitoring system (Prometheus alerting or cron):
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# check_audit_partitions.sh
|
||||
# Alert if the next month's partition does not exist by the 25th
|
||||
|
||||
NEXT_MONTH=$(date -d "+1 month" +%Y_%m)
|
||||
PARTITION="audit_events_${NEXT_MONTH}"
|
||||
|
||||
RESULT=$(docker exec bd-postgres-audit psql -U postgres -d auditdb -tAc "
|
||||
SELECT COUNT(*) FROM pg_class c
|
||||
JOIN pg_namespace n ON n.oid = c.relnamespace
|
||||
WHERE n.nspname = 'audit' AND c.relname = '${PARTITION}';")
|
||||
|
||||
if [ "$RESULT" -eq "0" ]; then
|
||||
echo "ALERT: Missing audit partition for next month: ${PARTITION}"
|
||||
# Send to your alerting system (PagerDuty, Slack, email)
|
||||
exit 1
|
||||
fi
|
||||
echo "OK: Partition ${PARTITION} exists"
|
||||
exit 0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring — key metrics to track
|
||||
|
||||
These metrics indicate when scaling actions are needed.
|
||||
|
||||
### PostgreSQL — fhir
|
||||
|
||||
```sql
|
||||
-- Connection utilisation (should be <80% of max_connections)
|
||||
SELECT count(*) as active_connections,
|
||||
max_conn,
|
||||
round(100.0 * count(*) / max_conn, 1) as utilisation_pct
|
||||
FROM pg_stat_activity, (SELECT setting::int as max_conn FROM pg_settings WHERE name='max_connections') mc
|
||||
WHERE state = 'active'
|
||||
GROUP BY max_conn;
|
||||
|
||||
-- Table bloat (trigger VACUUM if dead_tuple_ratio > 10%)
|
||||
SELECT relname, n_live_tup, n_dead_tup,
|
||||
round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) as dead_pct
|
||||
FROM pg_stat_user_tables
|
||||
WHERE relname IN ('hfj_resource', 'hfj_spidx_token', 'hfj_res_ver')
|
||||
ORDER BY dead_pct DESC;
|
||||
|
||||
-- Index usage (trigger REINDEX if idx_scan is 0 for a non-new index)
|
||||
SELECT relname, indexrelname, idx_scan, idx_tup_read
|
||||
FROM pg_stat_user_indexes
|
||||
WHERE relname LIKE 'hfj_%'
|
||||
ORDER BY idx_scan ASC
|
||||
LIMIT 10;
|
||||
```
|
||||
|
||||
### PostgreSQL — audit
|
||||
|
||||
```sql
|
||||
-- Partition sizes (plan next archive when any partition exceeds 10GB)
|
||||
SELECT
|
||||
c.relname as partition,
|
||||
pg_size_pretty(pg_relation_size(c.oid)) as size
|
||||
FROM pg_class c
|
||||
JOIN pg_inherits i ON i.inhrelid = c.oid
|
||||
JOIN pg_class p ON p.oid = i.inhparent
|
||||
JOIN pg_namespace n ON n.oid = p.relnamespace
|
||||
WHERE n.nspname = 'audit' AND p.relname = 'audit_events'
|
||||
ORDER BY c.relname DESC;
|
||||
|
||||
-- Rejection rate by vendor (flag vendors with >10% rejection rate)
|
||||
SELECT
|
||||
client_id,
|
||||
COUNT(*) as total_events,
|
||||
SUM(CASE WHEN outcome = 'REJECTED' THEN 1 ELSE 0 END) as rejections,
|
||||
ROUND(100.0 * SUM(CASE WHEN outcome = 'REJECTED' THEN 1 ELSE 0 END) / COUNT(*), 1) as rejection_pct
|
||||
FROM audit.audit_events
|
||||
WHERE event_time > NOW() - INTERVAL '7 days'
|
||||
AND event_type IN ('OPERATION', 'VALIDATION_FAILURE')
|
||||
GROUP BY client_id
|
||||
ORDER BY rejection_pct DESC;
|
||||
```
|
||||
|
||||
### HAPI — Prometheus metrics
|
||||
|
||||
Key metrics exposed at `/actuator/prometheus`:
|
||||
|
||||
| Metric | Alert threshold |
|
||||
|--------|-----------------|
|
||||
| `hikaricp_connections_pending` | >0 for >30s → pool exhaustion |
|
||||
| `hikaricp_connection_timeout_total` | Any increment → pool exhaustion |
|
||||
| `http_server_requests_seconds_max` | >30s → OCL timeout or slow validation |
|
||||
| `jvm_memory_used_bytes / jvm_memory_max_bytes` | >85% → OOM risk, increase container memory |
|
||||
| `process_uptime_seconds` | Resets → unexpected container restart |
|
||||
|
||||
---
|
||||
|
||||
## IG upgrade procedure
|
||||
|
||||
When BD Core IG advances from v0.2.1 to v0.3.0:
|
||||
|
||||
```bash
|
||||
# 1. On CI machine: place new package.tgz in src/main/resources/packages/
|
||||
cp bd.gov.dghs.core-0.3.0.tgz hapi-overlay/src/main/resources/packages/
|
||||
|
||||
# 2. Remove old package (one IG version per image)
|
||||
rm hapi-overlay/src/main/resources/packages/bd.gov.dghs.core-0.2.1.tgz
|
||||
|
||||
# 3. Update application.yaml / docker-compose env vars:
|
||||
# HAPI_IG_PACKAGE_CLASSPATH=classpath:packages/bd.gov.dghs.core-0.3.0.tgz
|
||||
# HAPI_IG_VERSION=0.3.0
|
||||
|
||||
# 4. Build and push new image
|
||||
docker build \
|
||||
--build-arg IG_PACKAGE=bd.gov.dghs.core-0.3.0.tgz \
|
||||
--build-arg BUILD_VERSION=1.1.0 \
|
||||
--build-arg GIT_COMMIT=$(git rev-parse --short HEAD) \
|
||||
-t your-registry.dghs.gov.bd/bd-fhir-hapi:1.1.0 \
|
||||
-f hapi-overlay/Dockerfile .
|
||||
|
||||
docker push your-registry.dghs.gov.bd/bd-fhir-hapi:1.1.0
|
||||
|
||||
# 5. Update HAPI_IMAGE in .env on production server
|
||||
# 6. Rolling redeploy
|
||||
docker-compose --env-file .env pull hapi
|
||||
docker-compose --env-file .env up -d --no-deps hapi
|
||||
|
||||
# 7. Verify new IG version is active
|
||||
curl -s https://fhir.dghs.gov.bd/fhir/metadata | jq '.software.version'
|
||||
# Expected: "0.3.0" or the configured HAPI_IG_VERSION value
|
||||
```
|
||||
|
||||
**Vendor notification:** IG upgrades that change SHALL constraints require
|
||||
vendor notification at least 30 days in advance. Vendors must test against
|
||||
the staging environment before production deployment.
|
||||
Reference in New Issue
Block a user