Files

Dr. B. M. Riazul Islam c11f0bd5bc first commit

2026-03-16 00:02:58 +06:00

37 KiB

Raw Permalink Blame History

BD FHIR National — Technical Operations Document

System: National FHIR R4 Repository and Validation Engine
Published by: DGHS / MoHFW Bangladesh
IG: BD Core FHIR IG v0.2.1
HAPI FHIR: 7.2.0
Stack: Java 17 · Spring Boot 3.2.5 · PostgreSQL 15 · Docker Compose

System Purpose and Architecture
Repository Structure
How the System Works
Infrastructure Components
Security Model
Validation Pipeline
Audit and Forensics
CI/CD Pipeline
First Deployment — Step by Step
Routine Operations
ICD-11 Version Upgrade
Scaling
Troubleshooting
Architecture Decisions You Must Not Reverse

1. System Purpose and Architecture

This system is the national FHIR R4 repository for Bangladesh. It serves three purposes simultaneously:

Repository — Stores validated FHIR R4 resources submitted by hospitals, clinics, diagnostic labs, and pharmacies (collectively: vendors). No unvalidated resource enters storage.

Validation engine — Every incoming resource is validated against BD Core FHIR IG profiles AND against the national ICD-11 terminology authority (OCL) before storage. Invalid resources are rejected with HTTP 422 and a FHIR OperationOutcome describing exactly what failed.

HIE gateway — Acts as the national Health Information Exchange boundary. The system enforces that only authenticated, authorised, and clinically valid data enters the national record.

Traffic flow

Vendor system
    │
    │  POST /fhir/Condition
    │  Authorization: Bearer {token}
    ▼
Centralised nginx proxy   ← TLS termination, routing (managed separately)
    │
    ▼
HAPI server :8080
    │
    ├─ KeycloakJwtInterceptor      ← validates JWT, extracts facility identity
    ├─ ClusterExpressionValidator  ← validates ICD-11 cluster expressions
    ├─ RequestValidatingInterceptor ← validates against BD Core IG profiles
    ├─ BdTerminologyValidationSupport ← validates ICD-11 codes against OCL
    │
    ├─ [ACCEPTED] → HFJ_RESOURCE (postgres-fhir)
    │               AuditEventEmitter → audit.audit_events (postgres-audit)
    │
    └─ [REJECTED] → 422 OperationOutcome to vendor
                    RejectedSubmissionSink → audit.fhir_rejected_submissions (postgres-audit)
                    AuditEventEmitter → audit.audit_events (postgres-audit)

External service dependencies

Service	URL	Purpose	Failure behaviour
Keycloak	`https://auth.dghs.gov.bd/realms/hris`	JWT validation, JWKS	Fail closed — all requests rejected
OCL	`https://tr.ocl.dghs.gov.bd/api/fhir`	ICD-11 terminology validation	Fail open — resource accepted with audit record
Cluster validator	`https://icd11.dghs.gov.bd/cluster/validate`	Postcoordinated ICD-11 expressions	Fail open — resource accepted with audit record

Fail-open policy for OCL and cluster validator is deliberate. Service continuity during external service outages takes precedence over perfect validation coverage. Every fail-open event is recorded in the audit log. OCL or cluster validator outages must be treated as high-priority incidents.

2. Repository Structure

bd-fhir-national/
├── .env.example                    ← copy to .env, fill secrets
├── docker-compose.yml              ← production orchestration
├── pom.xml                         ← parent Maven POM, version pins
├── hapi-overlay/
│   ├── Dockerfile                  ← multi-stage build
│   ├── pom.xml                     ← runtime dependencies
│   └── src/main/
│       ├── java/bd/gov/dghs/fhir/
│       │   ├── BdFhirApplication.java          ← Spring Boot entry point
│       │   ├── audit/
│       │   │   ├── AuditEventEmitter.java       ← async INSERT to audit_events
│       │   │   └── RejectedSubmissionSink.java  ← async INSERT to rejected_submissions
│       │   ├── config/
│       │   │   ├── DataSourceConfig.java        ← dual datasource, dual Flyway
│       │   │   ├── FhirServerConfig.java        ← validation chain, IG loading
│       │   │   └── SecurityConfig.java          ← interceptor registration
│       │   ├── init/
│       │   │   └── IgPackageInitializer.java    ← advisory lock IG loader
│       │   ├── interceptor/
│       │   │   ├── AuditEventInterceptor.java   ← audit hook
│       │   │   └── KeycloakJwtInterceptor.java  ← JWT auth
│       │   ├── terminology/
│       │   │   ├── BdTerminologyValidationSupport.java  ← OCL integration
│       │   │   └── TerminologyCacheManager.java         ← cache flush endpoint
│       │   └── validator/
│       │       └── ClusterExpressionValidator.java      ← cluster expression check
│       └── resources/
│           ├── application.yaml                ← all Spring/HAPI configuration
│           ├── logback-spring.xml              ← structured JSON logging
│           ├── db/migration/
│           │   ├── fhir/V1__hapi_schema.sql    ← HAPI JPA schema (Flyway)
│           │   └── audit/V2__audit_schema.sql  ← audit schema (Flyway)
│           └── packages/
│               └── .gitkeep                    ← CI places IG .tgz here
├── ops/
│   ├── deployment-guide.md
│   ├── keycloak-setup.md
│   ├── project-manifest.md
│   ├── scaling-roadmap.md
│   └── version-upgrade-integration.md
└── postgres/
    ├── fhir/
    │   ├── init.sql        ← template only — replace with init.sh before deploy
    │   └── postgresql.conf ← PostgreSQL tuning for HAPI workload
    └── audit/
        ├── init.sql        ← template only — replace with init.sh before deploy
        └── postgresql.conf ← PostgreSQL tuning for audit workload

3. How the System Works

Startup sequence

When a HAPI container starts, the following happens in order. If any step fails, the container exits and Docker restarts it.

Flyway — FHIR schema runs V1__hapi_schema.sql against postgres-fhir using the superuser credential. Creates all HAPI JPA tables, sequences, and indexes. Skipped if already applied.
Flyway — Audit schema runs V2__audit_schema.sql against postgres-audit. Creates partitioned audit_events and fhir_rejected_submissions tables with monthly partitions pre-created through 2027. Skipped if already applied.
Hibernate validation checks that the schema exactly matches HAPI's entity mappings (ddl-auto: validate). Fails loudly if tables are missing or wrong.
IgPackageInitializer acquires a PostgreSQL advisory lock on postgres-fhir, loads the BD Core IG package from the classpath into HAPI's NpmPackageValidationSupport, writes metadata to NPM_PACKAGE tables, and releases the lock. The advisory lock prevents race conditions when multiple replicas start simultaneously — only one replica writes the metadata row; subsequent replicas find it already present and skip.
KeycloakJwtInterceptor fetches the Keycloak JWKS endpoint and caches the signing keys. If Keycloak is unreachable at startup, the interceptor fails to initialise and the container exits.
Server begins accepting traffic.

Request lifecycle — accepted resource

1. KeycloakJwtInterceptor
   └─ extracts Bearer token from Authorization header
   └─ verifies signature against cached Keycloak JWKS
   └─ verifies exp, iss = https://auth.dghs.gov.bd/realms/hris
   └─ verifies mci-api role present in realm_access or resource_access
   └─ extracts client_id, sub, sending_facility
   └─ sets request attributes, populates MDC for log correlation

2. AuditEventInterceptor (pre-validation hook)
   └─ invokes ClusterExpressionValidator
      └─ scans Coding elements with system = http://id.who.int/icd/release/11/mms
      └─ if icd11-cluster-expression extension present → calls cluster validator middleware
      └─ if raw postcoordination chars (&, /, %) in code without extension → rejects immediately

3. RequestValidatingInterceptor
   └─ runs FhirInstanceValidator against ValidationSupportChain:
      1. DefaultProfileValidationSupport    (base FHIR R4 profiles)
      2. CommonCodeSystemsTerminologyService (UCUM, MimeType, etc.)
      3. SnapshotGeneratingValidationSupport (differential → snapshot)
      4. InMemoryTerminologyServerValidationSupport (cache layer)
      5. NpmPackageValidationSupport         (BD Core IG profiles)
      6. BdTerminologyValidationSupport      (OCL $validate-code for ICD-11)
   └─ any ERROR severity issue → throws UnprocessableEntityException → 422

4. HAPI JPA persistence
   └─ resource written to HFJ_RESOURCE, HFJ_RES_VER, SPIDX tables

5. AuditEventInterceptor (post-storage hook)
   └─ async: INSERT into audit.audit_events (outcome = ACCEPTED)

6. HTTP 201 Created → vendor

Request lifecycle — rejected resource

1-3. Same as above up to validation failure

4. UnprocessableEntityException thrown with FHIR OperationOutcome

5. AuditEventInterceptor (exception hook)
   └─ async: INSERT full payload into audit.fhir_rejected_submissions
   └─ async: INSERT into audit.audit_events (outcome = REJECTED)

6. HTTP 422 Unprocessable Entity → vendor
   Body: OperationOutcome with issue[].diagnostics and issue[].expression

ICD-11 terminology validation detail

BdTerminologyValidationSupport intercepts every call to validate an ICD-11 coded element:

Cache check — if the code was validated in the last 24 hours, serve result from ConcurrentHashMap. No OCL call.
Cache miss — call OCL $validate-code with system=http://id.who.int/icd/release/11/mms. For Condition.code, include url=https://fhir.dghs.gov.bd/core/ValueSet/bd-condition-icd11-diagnosis-valueset to enforce the Diagnosis + Finding class restriction.
OCL returns result=true — cache as valid, return valid to chain.
OCL returns result=false — cache as invalid, return error to chain → 422.
OCL timeout or 5xx — log WARN, return null (not supported) — fail open.
$expand attempts — isValueSetSupported() returns false for ICD-11 ValueSets. $expand is never attempted. This is intentional: OCL does not support $expand.

4. Infrastructure Components

Docker services

Service	Image	Purpose	Networks
`hapi`	Private registry	HAPI FHIR application	frontend, backend-fhir, backend-audit
`postgres-fhir`	postgres:15-alpine	FHIR resource store	backend-fhir
`postgres-audit`	postgres:15-alpine	Immutable audit store	backend-audit
`pgbouncer-fhir`	bitnami/pgbouncer:1.22.1	Connection pool → postgres-fhir	backend-fhir
`pgbouncer-audit`	bitnami/pgbouncer:1.22.1	Connection pool → postgres-audit	backend-audit

Network isolation

backend-fhir and backend-audit are marked internal: true — no external internet access from these networks. The database containers cannot reach external services and external services cannot reach the databases directly.

pgBouncer configuration

Both pgBouncer instances run in session mode. This is mandatory. HAPI uses Hibernate which relies on prepared statements — transaction mode pgBouncer breaks these. Do not change the pool mode.

Pool sizing at pilot phase (1 HAPI replica):

Pool	HikariCP max per replica	pgBouncer pool_size	PostgreSQL max_connections
FHIR	5	20	30
Audit	2	10	20

At 3 replicas: 15 FHIR connections, 6 audit connections — both within pool limits.

Databases

postgres-fhir contains all HAPI JPA tables. Schema managed by Flyway V1__hapi_schema.sql. ddl-auto: validate means Hibernate never modifies the schema — Flyway owns all DDL. If a HAPI upgrade requires schema changes, write a new Flyway migration.

postgres-audit contains the audit schema only. Two tables, both partitioned by month. Schema managed by Flyway V2__audit_schema.sql against postgres-audit (separate Flyway instance, separate history table flyway_audit_schema_history).

Volumes

Volume	Contents	Backup priority
`postgres-fhir-data`	All FHIR resources	Critical — primary data
`postgres-audit-data`	All audit records, rejected payloads	Critical — forensic/legal
`hapi-logs`	Structured JSON application logs	Medium — operational

5. Security Model

Authentication

Every request to FHIR endpoints (except GET /fhir/metadata and /actuator/health/**) requires a valid Bearer token issued by Keycloak realm hris.

KeycloakJwtInterceptor performs these checks in order, rejecting with HTTP 401 on any failure:

Authorization: Bearer header present and non-empty
JWT signature valid against Keycloak JWKS (RS256 only — symmetric algorithms rejected)
exp claim in the future (not expired)
iss claim exactly equals https://auth.dghs.gov.bd/realms/hris
mci-api role present in realm_access.roles OR in resource_access.{client-id}.roles

The JWKS is cached locally with a 1-hour TTL. On receiving a JWT with an unknown kid, the JWKS is immediately re-fetched regardless of TTL — this handles Keycloak key rotation without delay.

Authorisation

Vendors — must have mci-api role. Client naming convention: fhir-vendor-{organisation-id}.

Admin operations (cache flush endpoint) — must have fhir-admin role. Only the fhir-admin-pipeline service account and DGHS system administrators hold this role.

Keycloak client setup for new vendors

See ops/keycloak-setup.md for the full procedure. Summary:

Create client fhir-vendor-{org-id} in hris realm — confidential, service accounts enabled, standard flow off.
Assign mci-api role to the service account.
Add sending_facility user attribute with the DGHS facility code.
Add a User Attribute token mapper for sending_facility → token claim sending_facility.
Deliver client_id and client_secret to the vendor.

If a vendor token is missing the sending_facility claim, HAPI logs WARN on every submission and uses client_id as the facility identifier in audit records. This is a data quality issue — configure the mapper.

Vendor token flow

# Vendor obtains token
POST https://auth.dghs.gov.bd/realms/hris/protocol/openid-connect/token
  grant_type=client_credentials
  client_id=fhir-vendor-{org-id}
  client_secret={secret}
→ { "access_token": "eyJ...", "expires_in": 300 }

# Vendor submits resource
POST https://fhir.dghs.gov.bd/fhir/Condition
  Authorization: Bearer eyJ...
  Content-Type: application/fhir+json
  { ... }

Tokens expire in 5 minutes (Keycloak default). Vendor systems must refresh before expiry.

6. Validation Pipeline

BD Core IG profiles

The following resource types are validated against BD Core IG profiles:

Resource type	Profile URL
Patient	`https://fhir.dghs.gov.bd/core/StructureDefinition/bd-patient`
Condition	`https://fhir.dghs.gov.bd/core/StructureDefinition/bd-condition`
Encounter	`https://fhir.dghs.gov.bd/core/StructureDefinition/bd-encounter`
Observation	`https://fhir.dghs.gov.bd/core/StructureDefinition/bd-observation`
Practitioner	`https://fhir.dghs.gov.bd/core/StructureDefinition/bd-practitioner`
Organization	`https://fhir.dghs.gov.bd/core/StructureDefinition/bd-organization`
Location	`https://fhir.dghs.gov.bd/core/StructureDefinition/bd-location`
Medication	`https://fhir.dghs.gov.bd/core/StructureDefinition/bd-medication`
MedicationRequest	`https://fhir.dghs.gov.bd/core/StructureDefinition/bd-medicationrequest`
Immunization	`https://fhir.dghs.gov.bd/core/StructureDefinition/bd-immunization`

Resources of any other type are stored with meta.tag = https://fhir.dghs.gov.bd/tags|unvalidated-profile. They are not rejected. They can be queried with _tag=https://fhir.dghs.gov.bd/tags|unvalidated-profile.

ICD-11 cluster expression format

BD Core IG defines a specific pattern for postcoordinated ICD-11 expressions. Raw postcoordinated strings in Coding.code are prohibited.

Correct format:

"code": {
  "coding": [{
    "system": "http://id.who.int/icd/release/11/mms",
    "code": "1C62.0",
    "extension": [{
      "url": "icd11-cluster-expression",
      "valueString": "1C62.0/http%3A%2F%2Fid.who.int%2F..."
    }]
  }]
}

Prohibited format (rejected with 422):

"code": {
  "coding": [{
    "system": "http://id.who.int/icd/release/11/mms",
    "code": "1C62.0&has_severity=mild"
  }]
}

Rejection codes

The rejection_code column in audit.fhir_rejected_submissions contains one of:

Code	Meaning
`PROFILE_VIOLATION`	Resource violates a BD Core IG SHALL constraint
`TERMINOLOGY_INVALID_CODE`	ICD-11 code not found in OCL
`TERMINOLOGY_INVALID_CLASS`	ICD-11 code exists but is not Diagnosis/Finding class
`CLUSTER_EXPRESSION_INVALID`	Cluster expression failed cluster validator
`CLUSTER_STEM_MISSING_EXTENSION`	Raw postcoordinated string without extension
`AUTH_TOKEN_MISSING`	No Bearer token
`AUTH_TOKEN_EXPIRED`	Token `exp` in the past
`AUTH_TOKEN_INVALID_SIGNATURE`	Signature verification failed
`AUTH_TOKEN_MISSING_ROLE`	`mci-api` role absent
`AUTH_TOKEN_INVALID_ISSUER`	`iss` does not match Keycloak realm

7. Audit and Forensics

Two audit stores

audit.audit_events — one row per request outcome. Always written, accepted and rejected. Contains: event_type, operation, resource_type, resource_id, outcome, outcome_detail, sending_facility, client_id, subject, request_ip, request_id, validation_messages (JSONB).

audit.fhir_rejected_submissions — one row per rejected write. Contains: full resource payload as submitted (TEXT, not JSONB), rejection_code, rejection_reason, element_path, violated_profile, invalid_code, invalid_system.

Immutability

The audit_writer_login PostgreSQL user has INSERT only on the audit schema. The HAPI JVM connects to postgres-audit as this user. No UPDATE or DELETE is possible from the application layer regardless of what the application code attempts. Only a PostgreSQL superuser can modify audit records.

Partitioning

Both audit tables are partitioned by month (PARTITION BY RANGE (event_time)). Monthly partitions are pre-created through December 2027. A cron job must create next-month partitions on the 20th of each month. If this lapses, INSERT fails with a hard error.

Set up the cron job immediately after first deployment:

# On the host running postgres-audit
crontab -e
# Add:
0 0 20 * * docker exec bd-postgres-audit psql -U audit_maintainer_login -d auditdb \
  -c "SELECT audit.create_next_month_partitions();" \
  >> /var/log/bd-fhir-partition-maintenance.log 2>&1

Useful audit queries

-- Rejection rate by vendor, last 7 days
SELECT client_id,
       COUNT(*) AS total,
       SUM(CASE WHEN outcome='REJECTED' THEN 1 ELSE 0 END) AS rejected,
       ROUND(100.0 * SUM(CASE WHEN outcome='REJECTED' THEN 1 ELSE 0 END) / COUNT(*), 1) AS pct
FROM audit.audit_events
WHERE event_time > NOW() - INTERVAL '7 days'
  AND event_type IN ('OPERATION','VALIDATION_FAILURE')
GROUP BY client_id ORDER BY pct DESC;

-- Retrieve rejected payloads for a vendor
SELECT submission_time, resource_type, rejection_code, rejection_reason, element_path
FROM audit.fhir_rejected_submissions
WHERE client_id = 'fhir-vendor-{org-id}'
ORDER BY submission_time DESC LIMIT 20;

-- Auth failures
SELECT event_time, client_id, outcome_detail, request_ip
FROM audit.audit_events
WHERE event_type = 'AUTH_FAILURE'
ORDER BY event_time DESC LIMIT 20;

8. CI/CD Pipeline

The production server never builds. It only pulls pre-built images from the private registry.

CI pipeline steps (on CI machine)

# 1. Obtain BD Core IG package and place it
cp /path/to/bd.gov.dghs.core-0.2.1.tgz \
   hapi-overlay/src/main/resources/packages/

# 2. Run tests (TestContainers spins up real PostgreSQL — no H2)
mvn test -pl hapi-overlay -am

# 3. Build Docker image (multi-stage: Maven builder + JRE runtime)
docker build \
  --build-arg IG_PACKAGE=bd.gov.dghs.core-0.2.1.tgz \
  --build-arg BUILD_VERSION=1.0.0 \
  --build-arg GIT_COMMIT=$(git rev-parse --short HEAD) \
  -t your-registry.dghs.gov.bd/bd-fhir-hapi:1.0.0 \
  -f hapi-overlay/Dockerfile \
  .

# 4. Push to private registry
docker push your-registry.dghs.gov.bd/bd-fhir-hapi:1.0.0

The packages/ directory must contain exactly one .tgz file matching HAPI_IG_PACKAGE_CLASSPATH in .env. If the directory is empty or the filename does not match, the container fails startup immediately with a clear error message.

9. First Deployment — Step by Step

Prerequisites

Ubuntu 22.04 LTS, minimum 8GB RAM, 4 vCPU, 100GB disk
Outbound HTTPS to Keycloak, OCL, cluster validator, private registry
Docker image already built and pushed (see Section 8)
Keycloak configured (see ops/keycloak-setup.md)

Step 1 — Install Docker

sudo apt-get update
sudo apt-get install -y ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
  -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] \
  https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io \
  docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER
# log out and back in

Step 2 — Prepare application directory

sudo mkdir -p /opt/bd-fhir-national
sudo chown $USER:$USER /opt/bd-fhir-national
# rsync project files from CI/deployment machine (excluding source tree)
rsync -avz --exclude='.git' --exclude='hapi-overlay/target' \
  --exclude='hapi-overlay/src' \
  ./bd-fhir-national/ deploy@server:/opt/bd-fhir-national/

Step 3 — Create .env

cd /opt/bd-fhir-national
cp .env.example .env
chmod 600 .env
nano .env   # fill all <CHANGE_ME> values
# verify: grep CHANGE_ME .env should return nothing

Step 4 — Fix init scripts (CRITICAL — do not skip)

The postgres/fhir/init.sql and postgres/audit/init.sql files are templates with placeholder passwords. PostgreSQL Docker does not perform variable substitution in .sql files. Replace them with .sh scripts that read from environment variables.

# FHIR database init script
cat > /opt/bd-fhir-national/postgres/fhir/init.sh <<'EOF'
#!/bin/bash
set -e
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
    DO \$\$ BEGIN
        IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = '${FHIR_DB_APP_USER}') THEN
            CREATE USER ${FHIR_DB_APP_USER} WITH NOSUPERUSER NOCREATEDB NOCREATEROLE
                NOINHERIT LOGIN CONNECTION LIMIT 30 PASSWORD '${FHIR_DB_APP_PASSWORD}';
        END IF;
    END \$\$;
    GRANT CONNECT ON DATABASE ${POSTGRES_DB} TO ${FHIR_DB_APP_USER};
    GRANT USAGE ON SCHEMA public TO ${FHIR_DB_APP_USER};
    ALTER DEFAULT PRIVILEGES IN SCHEMA public
        GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO ${FHIR_DB_APP_USER};
    ALTER DEFAULT PRIVILEGES IN SCHEMA public
        GRANT USAGE, SELECT ON SEQUENCES TO ${FHIR_DB_APP_USER};
EOSQL
EOF
chmod +x /opt/bd-fhir-national/postgres/fhir/init.sh

# Audit database init script
cat > /opt/bd-fhir-national/postgres/audit/init.sh <<'EOF'
#!/bin/bash
set -e
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
    DO \$\$ BEGIN
        IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = '${AUDIT_DB_WRITER_USER}') THEN
            CREATE USER ${AUDIT_DB_WRITER_USER} WITH NOSUPERUSER NOCREATEDB NOCREATEROLE
                NOINHERIT LOGIN CONNECTION LIMIT 20 PASSWORD '${AUDIT_DB_WRITER_PASSWORD}';
        END IF;
        IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = '${AUDIT_DB_MAINTAINER_USER}') THEN
            CREATE USER ${AUDIT_DB_MAINTAINER_USER} WITH NOSUPERUSER NOCREATEDB NOCREATEROLE
                NOINHERIT LOGIN CONNECTION LIMIT 5 PASSWORD '${AUDIT_DB_MAINTAINER_PASSWORD}';
        END IF;
    END \$\$;
    GRANT CONNECT ON DATABASE ${POSTGRES_DB} TO ${AUDIT_DB_WRITER_USER};
    GRANT CONNECT ON DATABASE ${POSTGRES_DB} TO ${AUDIT_DB_MAINTAINER_USER};
EOSQL
EOF
chmod +x /opt/bd-fhir-national/postgres/audit/init.sh

Update docker-compose.yml — in both postgres services, change the init volume mount from .sql to .sh, and pass the necessary env vars to postgres-audit:

# postgres-fhir volumes: change
- ./postgres/fhir/init.sh:/docker-entrypoint-initdb.d/init.sh:ro
# add to postgres-fhir environment:
FHIR_DB_APP_USER:     ${FHIR_DB_APP_USER}
FHIR_DB_APP_PASSWORD: ${FHIR_DB_APP_PASSWORD}

# postgres-audit volumes: change
- ./postgres/audit/init.sh:/docker-entrypoint-initdb.d/init.sh:ro
# add to postgres-audit environment:
AUDIT_DB_WRITER_USER:         ${AUDIT_DB_WRITER_USER}
AUDIT_DB_WRITER_PASSWORD:     ${AUDIT_DB_WRITER_PASSWORD}
AUDIT_DB_MAINTAINER_USER:     ${AUDIT_DB_MAINTAINER_USER}
AUDIT_DB_MAINTAINER_PASSWORD: ${AUDIT_DB_MAINTAINER_PASSWORD}

docker login your-registry.dghs.gov.bd
docker compose --env-file .env pull

Step 6 — Start databases

docker compose --env-file .env up -d postgres-fhir postgres-audit
# wait for healthy
until docker compose --env-file .env ps postgres-fhir | grep -q "healthy"; do sleep 3; done
until docker compose --env-file .env ps postgres-audit | grep -q "healthy"; do sleep 3; done

Step 7 — Verify database users

docker exec bd-postgres-fhir psql -U postgres -d fhirdb \
  -c "SELECT rolname FROM pg_roles WHERE rolname='hapi_app';"
# Expected: hapi_app

docker exec bd-postgres-audit psql -U postgres -d auditdb \
  -c "SELECT rolname FROM pg_roles WHERE rolname IN ('audit_writer_login','audit_maintainer_login');"
# Expected: two rows

Step 8 — Start pgBouncer and HAPI

docker compose --env-file .env up -d pgbouncer-fhir pgbouncer-audit
docker compose --env-file .env up -d hapi

# Follow startup — takes 60-120 seconds
docker compose --env-file .env logs -f hapi

Expected log sequence:

Running FHIR Flyway migrations...   → V1 applied
Running Audit Flyway migrations...  → V2 applied
Advisory lock acquired...           → IG loading begins
BD Core IG package loaded...        → IG ready
BdTerminologyValidationSupport initialised...
KeycloakJwtInterceptor initialised...
HAPI RestfulServer interceptors registered...
Tomcat started on port(s): 8080
Started BdFhirApplication in XX seconds

Step 9 — Verify health

# Internal (direct to HAPI)
docker exec $(docker compose --env-file .env ps -q hapi | head -1) \
  curl -s http://localhost:8080/actuator/health | jq .
# All components must show status: UP

# FHIR metadata
docker exec $(docker compose --env-file .env ps -q hapi | head -1) \
  curl -s http://localhost:8080/fhir/metadata | jq '.software'
# Expected: { "name": "BD FHIR National Repository", "version": "0.2.1" }

Step 10 — Set up partition maintenance cron

crontab -e
# Add:
0 0 20 * * docker exec bd-postgres-audit psql -U audit_maintainer_login -d auditdb \
  -c "SELECT audit.create_next_month_partitions();" \
  >> /var/log/bd-fhir-partition-maintenance.log 2>&1

Step 11 — Run acceptance tests

Run all tests from Section 9.3 of ops/deployment-guide.md. All nine must pass before the system is declared production-ready.

10. Routine Operations

View logs

# All services
docker compose --env-file .env logs -f

# HAPI logs as structured JSON
docker compose --env-file .env logs -f hapi | jq -R 'try fromjson'

# Filter for rejections
docker compose --env-file .env logs hapi | \
  jq -R 'try fromjson | select(.message | test("rejected|REJECTED"))'

Deploy a new image version

# Update image tag in .env
nano /opt/bd-fhir-national/.env
# Change HAPI_IMAGE to new tag

# Pull and redeploy
docker compose --env-file .env pull hapi
docker compose --env-file .env up -d --no-deps hapi

# Verify startup
docker compose --env-file .env logs -f hapi

Scale HAPI replicas

docker compose --env-file .env up -d --scale hapi=3
# No other configuration changes needed at 3 replicas.
# pgBouncer pool_size=20 supports up to 4 replicas at HikariCP max=5.
# At 5+ replicas: increase PGBOUNCER_DEFAULT_POOL_SIZE and postgres max_connections first.

Restart a service

docker compose --env-file .env restart hapi
docker compose --env-file .env restart postgres-fhir   # causes brief HAPI downtime

Full stack restart

docker compose --env-file .env down
docker compose --env-file .env up -d

Check pgBouncer pool status

docker exec bd-pgbouncer-fhir psql -h localhost -p 5432 -U pgbouncer pgbouncer \
  -c "SHOW POOLS;"

11. ICD-11 Version Upgrade

When a new ICD-11 MMS release is imported into OCL, the HAPI terminology cache becomes stale. The upgrade pipeline must flush the cache after OCL import. Full procedure in ops/version-upgrade-integration.md. Summary:

Order is mandatory:

OCL: import new ICD-11 concepts
OCL: patch concept_class for Diagnosis + Finding
OCL: repopulate bd-condition-icd11-diagnosis-valueset
OCL: verify $validate-code returns correct results for new codes
HAPI: flush terminology cache
HAPI: verify new codes validate correctly

Step 5 — cache flush:

# Get fhir-admin token
ADMIN_TOKEN=$(curl -s -X POST \
  "https://auth.dghs.gov.bd/realms/hris/protocol/openid-connect/token" \
  -d "grant_type=client_credentials" \
  -d "client_id=fhir-admin-pipeline" \
  -d "client_secret=${FHIR_ADMIN_CLIENT_SECRET}" \
  | jq -r '.access_token')

# Flush — run from inside Docker network (admin endpoint is network-restricted)
docker exec $(docker compose --env-file .env ps -q hapi | head -1) \
  curl -s -X DELETE \
    -H "Authorization: Bearer ${ADMIN_TOKEN}" \
    http://localhost:8080/admin/terminology/cache | jq .
# Expected: { "status": "flushed", "entriesEvicted": N }

IG version upgrade (when BD Core IG advances to a new version):

Place new .tgz in src/main/resources/packages/, remove old one.
Update HAPI_IG_PACKAGE_CLASSPATH and HAPI_IG_VERSION in .env.
Build and push new Docker image on CI machine.
Deploy new image on production server.

12. Scaling

Current capacity (Phase 1 — Pilot)

Metric	Capacity
HAPI replicas	1
Vendors	<50
Resources/day	<10,000
PostgreSQL connections (FHIR)	5
PostgreSQL connections (Audit)	2

Scaling to Phase 2 (Regional — up to 500 vendors, 100,000 resources/day)

# Scale HAPI to 3 replicas — no other changes required
docker compose --env-file .env up -d --scale hapi=3

Beyond 3 replicas, update pgBouncer pool sizes and PostgreSQL max_connections before scaling. See ops/scaling-roadmap.md for the full capacity matrix and Phase 3 (national scale → Kubernetes) guidance.

13. Troubleshooting

Container not starting

docker compose --env-file .env logs hapi | tail -50

Log message	Cause	Fix
`STARTUP FAILURE: BD Core IG package not found`	`.tgz` missing from image	Rebuild image with package in `packages/`
`FHIR Flyway configuration missing`	`SPRING_FLYWAY_*` env vars not set	Check `.env`
`password authentication failed for user "hapi_app"`	`init.sh` not run or wrong password	Verify Step 4 of deployment, check `.env` passwords
`Advisory lock acquisition timed out`	Another replica holding lock and crashed mid-init	Check `pg_locks` on postgres-fhir, kill stale lock
`Connection refused` to Keycloak JWKS	Keycloak unreachable at startup	Check network connectivity, Keycloak health
`Schema-validation: missing table`	Flyway did not run	Check `SPRING_FLYWAY_*` env vars, check flyway_schema_history table

401 on all authenticated requests

# Check JWKS endpoint is reachable from inside the container
docker exec $(docker compose --env-file .env ps -q hapi | head -1) \
  curl -s https://auth.dghs.gov.bd/realms/hris/protocol/openid-connect/certs | jq '.keys | length'
# Expected: 1 or more keys

If JWKS is unreachable, all requests will be rejected with 401 (fail closed). Check firewall rules — the HAPI container must have outbound HTTPS to Keycloak.

422 on all ICD-11 coded submissions

# Check OCL is reachable
docker exec $(docker compose --env-file .env ps -q hapi | head -1) \
  curl -s -o /dev/null -w "%{http_code}" \
  "https://tr.ocl.dghs.gov.bd/api/fhir/metadata"
# Expected: 200

# Check a specific code manually
docker exec $(docker compose --env-file .env ps -q hapi | head -1) \
  curl -s "https://tr.ocl.dghs.gov.bd/api/fhir/ValueSet/\$validate-code?\
url=https://fhir.dghs.gov.bd/core/ValueSet/bd-condition-icd11-diagnosis-valueset\
&system=http://id.who.int/icd/release/11/mms&code=1C62.0" | jq .

If OCL is unreachable, the system should be fail-open (codes accepted). If codes are being rejected despite OCL being reachable, check OCL's $validate-code response directly.

Audit writes failing

# Check HAPI logs for "AUDIT WRITE FAILED"
docker compose --env-file .env logs hapi | grep "AUDIT WRITE FAILED"

# Check audit datasource health
docker exec $(docker compose --env-file .env ps -q hapi | head -1) \
  curl -s http://localhost:8080/actuator/health | jq '.components.auditDb'

Partition missing (INSERT to audit failing)

# Check which partitions exist
docker exec bd-postgres-audit psql -U postgres -d auditdb -c "
SELECT c.relname FROM pg_class c
JOIN pg_inherits i ON i.inhrelid = c.oid
JOIN pg_class p ON p.oid = i.inhparent
JOIN pg_namespace n ON n.oid = p.relnamespace
WHERE n.nspname = 'audit' AND p.relname = 'audit_events'
ORDER BY c.relname DESC LIMIT 3;"

# Create missing partition manually
docker exec bd-postgres-audit psql -U postgres -d auditdb \
  -c "SELECT audit.create_next_month_partitions();"

Check disk usage

docker system df -v
df -h /var/lib/docker

14. Architecture Decisions You Must Not Reverse

These decisions are load-bearing. Reversing any of them without fully understanding the consequences will break the system.

PostgreSQL only — no H2, not even for tests.
The test suite uses TestContainers to spin up real PostgreSQL 15. H2 is not on the classpath. Using H2 masks database-specific behaviour (advisory locks, partitioning, JSONB) and produces false-green test results.

Validation on ALL requests — no vendor exemptions.
The RequestValidatingInterceptor runs on every write. There is no per-vendor or per-resource-type bypass. This is the HIE boundary enforcement. A bypass for one vendor breaks the national data quality guarantee for everyone downstream.

OCL is the single terminology authority.
There is no local ICD-11 concept store. All ICD-11 validation goes to OCL. This means OCL availability affects HAPI validation quality. Keep OCL healthy. Do not add a local fallback without understanding the implications for version consistency.

$expand is never attempted for ICD-11 ValueSets.
OCL does not support $expand. The isValueSetSupported() override returns false for all ICD-11 ValueSets. Do not remove this — removing it causes HAPI to attempt $expand, receive an empty response, and reject every ICD-11 coded resource regardless of whether the code is valid.

pgBouncer must remain in session mode.
Hibernate uses prepared statements and advisory locks. Transaction mode pgBouncer breaks both. Do not change PGBOUNCER_POOL_MODE to transaction.

Flyway owns all DDL — Hibernate never modifies schema.
ddl-auto: validate means Hibernate will refuse to start if the schema does not match its entities, but it will never ALTER or CREATE tables. If a HAPI upgrade changes entity mappings, write a Flyway migration. Never change ddl-auto to update in production.

Audit writes are append-only.
The audit_writer_login PostgreSQL user has INSERT only. The application cannot UPDATE or DELETE audit records regardless of what the code does. This is enforced at the database level. Do not grant additional privileges to this user.

The IG package is bundled in the Docker image.
The .tgz is a build-time artifact, not a runtime configuration. There is no hot-reload. An IG upgrade requires a new Docker image build and deployment. This is by design — it ties IG version to container version, making deployments auditable and rollbacks clean.

37 KiB Raw Permalink Blame History