# BD FHIR National — Technical Operations Document **System:** National FHIR R4 Repository and Validation Engine **Published by:** DGHS / MoHFW Bangladesh **IG:** BD Core FHIR IG v0.2.1 **HAPI FHIR:** 7.2.0 **Stack:** Java 17 · Spring Boot 3.2.5 · PostgreSQL 15 · Docker Compose --- ## Table of Contents 1. [System Purpose and Architecture](#1-system-purpose-and-architecture) 2. [Repository Structure](#2-repository-structure) 3. [How the System Works](#3-how-the-system-works) 4. [Infrastructure Components](#4-infrastructure-components) 5. [Security Model](#5-security-model) 6. [Validation Pipeline](#6-validation-pipeline) 7. [Audit and Forensics](#7-audit-and-forensics) 8. [CI/CD Pipeline](#8-cicd-pipeline) 9. [First Deployment — Step by Step](#9-first-deployment--step-by-step) 10. [Routine Operations](#10-routine-operations) 11. [ICD-11 Version Upgrade](#11-icd-11-version-upgrade) 12. [Scaling](#12-scaling) 13. [Troubleshooting](#13-troubleshooting) 14. [Architecture Decisions You Must Not Reverse](#14-architecture-decisions-you-must-not-reverse) --- ## 1. System Purpose and Architecture This system is the national FHIR R4 repository for Bangladesh. It serves three purposes simultaneously: **Repository** — Stores validated FHIR R4 resources submitted by hospitals, clinics, diagnostic labs, and pharmacies (collectively: vendors). No unvalidated resource enters storage. **Validation engine** — Every incoming resource is validated against BD Core FHIR IG profiles AND against the national ICD-11 terminology authority (OCL) before storage. Invalid resources are rejected with HTTP 422 and a FHIR OperationOutcome describing exactly what failed. **HIE gateway** — Acts as the national Health Information Exchange boundary. The system enforces that only authenticated, authorised, and clinically valid data enters the national record. ### Traffic flow ``` Vendor system │ │ POST /fhir/Condition │ Authorization: Bearer {token} ▼ Centralised nginx proxy ← TLS termination, routing (managed separately) │ ▼ HAPI server :8080 │ ├─ KeycloakJwtInterceptor ← validates JWT, extracts facility identity ├─ ClusterExpressionValidator ← validates ICD-11 cluster expressions ├─ RequestValidatingInterceptor ← validates against BD Core IG profiles ├─ BdTerminologyValidationSupport ← validates ICD-11 codes against OCL │ ├─ [ACCEPTED] → HFJ_RESOURCE (postgres-fhir) │ AuditEventEmitter → audit.audit_events (postgres-audit) │ └─ [REJECTED] → 422 OperationOutcome to vendor RejectedSubmissionSink → audit.fhir_rejected_submissions (postgres-audit) AuditEventEmitter → audit.audit_events (postgres-audit) ``` ### External service dependencies | Service | URL | Purpose | Failure behaviour | |---------|-----|---------|-------------------| | Keycloak | `https://auth.dghs.gov.bd/realms/hris` | JWT validation, JWKS | Fail closed — all requests rejected | | OCL | `https://tr.ocl.dghs.gov.bd/api/fhir` | ICD-11 terminology validation | Fail open — resource accepted with audit record | | Cluster validator | `https://icd11.dghs.gov.bd/cluster/validate` | Postcoordinated ICD-11 expressions | Fail open — resource accepted with audit record | **Fail-open policy for OCL and cluster validator is deliberate.** Service continuity during external service outages takes precedence over perfect validation coverage. Every fail-open event is recorded in the audit log. OCL or cluster validator outages must be treated as high-priority incidents. --- ## 2. Repository Structure ``` bd-fhir-national/ ├── .env.example ← copy to .env, fill secrets ├── docker-compose.yml ← production orchestration ├── pom.xml ← parent Maven POM, version pins ├── hapi-overlay/ │ ├── Dockerfile ← multi-stage build │ ├── pom.xml ← runtime dependencies │ └── src/main/ │ ├── java/bd/gov/dghs/fhir/ │ │ ├── BdFhirApplication.java ← Spring Boot entry point │ │ ├── audit/ │ │ │ ├── AuditEventEmitter.java ← async INSERT to audit_events │ │ │ └── RejectedSubmissionSink.java ← async INSERT to rejected_submissions │ │ ├── config/ │ │ │ ├── DataSourceConfig.java ← dual datasource, dual Flyway │ │ │ ├── FhirServerConfig.java ← validation chain, IG loading │ │ │ └── SecurityConfig.java ← interceptor registration │ │ ├── init/ │ │ │ └── IgPackageInitializer.java ← advisory lock IG loader │ │ ├── interceptor/ │ │ │ ├── AuditEventInterceptor.java ← audit hook │ │ │ └── KeycloakJwtInterceptor.java ← JWT auth │ │ ├── terminology/ │ │ │ ├── BdTerminologyValidationSupport.java ← OCL integration │ │ │ └── TerminologyCacheManager.java ← cache flush endpoint │ │ └── validator/ │ │ └── ClusterExpressionValidator.java ← cluster expression check │ └── resources/ │ ├── application.yaml ← all Spring/HAPI configuration │ ├── logback-spring.xml ← structured JSON logging │ ├── db/migration/ │ │ ├── fhir/V1__hapi_schema.sql ← HAPI JPA schema (Flyway) │ │ └── audit/V2__audit_schema.sql ← audit schema (Flyway) │ └── packages/ │ └── .gitkeep ← CI places IG .tgz here ├── ops/ │ ├── deployment-guide.md │ ├── keycloak-setup.md │ ├── project-manifest.md │ ├── scaling-roadmap.md │ └── version-upgrade-integration.md └── postgres/ ├── fhir/ │ ├── init.sql ← template only — replace with init.sh before deploy │ └── postgresql.conf ← PostgreSQL tuning for HAPI workload └── audit/ ├── init.sql ← template only — replace with init.sh before deploy └── postgresql.conf ← PostgreSQL tuning for audit workload ``` --- ## 3. How the System Works ### Startup sequence When a HAPI container starts, the following happens in order. If any step fails, the container exits and Docker restarts it. 1. **Flyway — FHIR schema** runs `V1__hapi_schema.sql` against `postgres-fhir` using the superuser credential. Creates all HAPI JPA tables, sequences, and indexes. Skipped if already applied. 2. **Flyway — Audit schema** runs `V2__audit_schema.sql` against `postgres-audit`. Creates partitioned `audit_events` and `fhir_rejected_submissions` tables with monthly partitions pre-created through 2027. Skipped if already applied. 3. **Hibernate validation** checks that the schema exactly matches HAPI's entity mappings (`ddl-auto: validate`). Fails loudly if tables are missing or wrong. 4. **IgPackageInitializer** acquires a PostgreSQL advisory lock on `postgres-fhir`, loads the BD Core IG package from the classpath into HAPI's `NpmPackageValidationSupport`, writes metadata to `NPM_PACKAGE` tables, and releases the lock. The advisory lock prevents race conditions when multiple replicas start simultaneously — only one replica writes the metadata row; subsequent replicas find it already present and skip. 5. **KeycloakJwtInterceptor** fetches the Keycloak JWKS endpoint and caches the signing keys. If Keycloak is unreachable at startup, the interceptor fails to initialise and the container exits. 6. Server begins accepting traffic. ### Request lifecycle — accepted resource ``` 1. KeycloakJwtInterceptor └─ extracts Bearer token from Authorization header └─ verifies signature against cached Keycloak JWKS └─ verifies exp, iss = https://auth.dghs.gov.bd/realms/hris └─ verifies mci-api role present in realm_access or resource_access └─ extracts client_id, sub, sending_facility └─ sets request attributes, populates MDC for log correlation 2. AuditEventInterceptor (pre-validation hook) └─ invokes ClusterExpressionValidator └─ scans Coding elements with system = http://id.who.int/icd/release/11/mms └─ if icd11-cluster-expression extension present → calls cluster validator middleware └─ if raw postcoordination chars (&, /, %) in code without extension → rejects immediately 3. RequestValidatingInterceptor └─ runs FhirInstanceValidator against ValidationSupportChain: 1. DefaultProfileValidationSupport (base FHIR R4 profiles) 2. CommonCodeSystemsTerminologyService (UCUM, MimeType, etc.) 3. SnapshotGeneratingValidationSupport (differential → snapshot) 4. InMemoryTerminologyServerValidationSupport (cache layer) 5. NpmPackageValidationSupport (BD Core IG profiles) 6. BdTerminologyValidationSupport (OCL $validate-code for ICD-11) └─ any ERROR severity issue → throws UnprocessableEntityException → 422 4. HAPI JPA persistence └─ resource written to HFJ_RESOURCE, HFJ_RES_VER, SPIDX tables 5. AuditEventInterceptor (post-storage hook) └─ async: INSERT into audit.audit_events (outcome = ACCEPTED) 6. HTTP 201 Created → vendor ``` ### Request lifecycle — rejected resource ``` 1-3. Same as above up to validation failure 4. UnprocessableEntityException thrown with FHIR OperationOutcome 5. AuditEventInterceptor (exception hook) └─ async: INSERT full payload into audit.fhir_rejected_submissions └─ async: INSERT into audit.audit_events (outcome = REJECTED) 6. HTTP 422 Unprocessable Entity → vendor Body: OperationOutcome with issue[].diagnostics and issue[].expression ``` ### ICD-11 terminology validation detail `BdTerminologyValidationSupport` intercepts every call to validate an ICD-11 coded element: 1. **Cache check** — if the code was validated in the last 24 hours, serve result from `ConcurrentHashMap`. No OCL call. 2. **Cache miss** — call OCL `$validate-code` with `system=http://id.who.int/icd/release/11/mms`. For `Condition.code`, include `url=https://fhir.dghs.gov.bd/core/ValueSet/bd-condition-icd11-diagnosis-valueset` to enforce the Diagnosis + Finding class restriction. 3. **OCL returns result=true** — cache as valid, return valid to chain. 4. **OCL returns result=false** — cache as invalid, return error to chain → 422. 5. **OCL timeout or 5xx** — log WARN, return null (not supported) — fail open. 6. **`$expand` attempts** — `isValueSetSupported()` returns false for ICD-11 ValueSets. `$expand` is never attempted. This is intentional: OCL does not support `$expand`. --- ## 4. Infrastructure Components ### Docker services | Service | Image | Purpose | Networks | |---------|-------|---------|----------| | `hapi` | Private registry | HAPI FHIR application | frontend, backend-fhir, backend-audit | | `postgres-fhir` | postgres:15-alpine | FHIR resource store | backend-fhir | | `postgres-audit` | postgres:15-alpine | Immutable audit store | backend-audit | | `pgbouncer-fhir` | bitnami/pgbouncer:1.22.1 | Connection pool → postgres-fhir | backend-fhir | | `pgbouncer-audit` | bitnami/pgbouncer:1.22.1 | Connection pool → postgres-audit | backend-audit | ### Network isolation `backend-fhir` and `backend-audit` are marked `internal: true` — no external internet access from these networks. The database containers cannot reach external services and external services cannot reach the databases directly. ### pgBouncer configuration Both pgBouncer instances run in **session mode**. This is mandatory. HAPI uses Hibernate which relies on prepared statements — transaction mode pgBouncer breaks these. Do not change the pool mode. Pool sizing at pilot phase (1 HAPI replica): | Pool | HikariCP max per replica | pgBouncer pool_size | PostgreSQL max_connections | |------|--------------------------|--------------------|-----------------------------| | FHIR | 5 | 20 | 30 | | Audit | 2 | 10 | 20 | At 3 replicas: 15 FHIR connections, 6 audit connections — both within pool limits. ### Databases **postgres-fhir** contains all HAPI JPA tables. Schema managed by Flyway `V1__hapi_schema.sql`. `ddl-auto: validate` means Hibernate never modifies the schema — Flyway owns all DDL. If a HAPI upgrade requires schema changes, write a new Flyway migration. **postgres-audit** contains the audit schema only. Two tables, both partitioned by month. Schema managed by Flyway `V2__audit_schema.sql` against postgres-audit (separate Flyway instance, separate history table `flyway_audit_schema_history`). ### Volumes | Volume | Contents | Backup priority | |--------|----------|-----------------| | `postgres-fhir-data` | All FHIR resources | Critical — primary data | | `postgres-audit-data` | All audit records, rejected payloads | Critical — forensic/legal | | `hapi-logs` | Structured JSON application logs | Medium — operational | --- ## 5. Security Model ### Authentication Every request to FHIR endpoints (except `GET /fhir/metadata` and `/actuator/health/**`) requires a valid Bearer token issued by Keycloak realm `hris`. `KeycloakJwtInterceptor` performs these checks in order, rejecting with HTTP 401 on any failure: 1. `Authorization: Bearer` header present and non-empty 2. JWT signature valid against Keycloak JWKS (`RS256` only — symmetric algorithms rejected) 3. `exp` claim in the future (not expired) 4. `iss` claim exactly equals `https://auth.dghs.gov.bd/realms/hris` 5. `mci-api` role present in `realm_access.roles` OR in `resource_access.{client-id}.roles` The JWKS is cached locally with a 1-hour TTL. On receiving a JWT with an unknown `kid`, the JWKS is immediately re-fetched regardless of TTL — this handles Keycloak key rotation without delay. ### Authorisation **Vendors** — must have `mci-api` role. Client naming convention: `fhir-vendor-{organisation-id}`. **Admin operations** (cache flush endpoint) — must have `fhir-admin` role. Only the `fhir-admin-pipeline` service account and DGHS system administrators hold this role. ### Keycloak client setup for new vendors See `ops/keycloak-setup.md` for the full procedure. Summary: 1. Create client `fhir-vendor-{org-id}` in `hris` realm — confidential, service accounts enabled, standard flow off. 2. Assign `mci-api` role to the service account. 3. Add `sending_facility` user attribute with the DGHS facility code. 4. Add a User Attribute token mapper for `sending_facility` → token claim `sending_facility`. 5. Deliver `client_id` and `client_secret` to the vendor. If a vendor token is missing the `sending_facility` claim, HAPI logs WARN on every submission and uses `client_id` as the facility identifier in audit records. This is a data quality issue — configure the mapper. ### Vendor token flow ```bash # Vendor obtains token POST https://auth.dghs.gov.bd/realms/hris/protocol/openid-connect/token grant_type=client_credentials client_id=fhir-vendor-{org-id} client_secret={secret} → { "access_token": "eyJ...", "expires_in": 300 } # Vendor submits resource POST https://fhir.dghs.gov.bd/fhir/Condition Authorization: Bearer eyJ... Content-Type: application/fhir+json { ... } ``` Tokens expire in 5 minutes (Keycloak default). Vendor systems must refresh before expiry. --- ## 6. Validation Pipeline ### BD Core IG profiles The following resource types are validated against BD Core IG profiles: | Resource type | Profile URL | |---------------|-------------| | Patient | `https://fhir.dghs.gov.bd/core/StructureDefinition/bd-patient` | | Condition | `https://fhir.dghs.gov.bd/core/StructureDefinition/bd-condition` | | Encounter | `https://fhir.dghs.gov.bd/core/StructureDefinition/bd-encounter` | | Observation | `https://fhir.dghs.gov.bd/core/StructureDefinition/bd-observation` | | Practitioner | `https://fhir.dghs.gov.bd/core/StructureDefinition/bd-practitioner` | | Organization | `https://fhir.dghs.gov.bd/core/StructureDefinition/bd-organization` | | Location | `https://fhir.dghs.gov.bd/core/StructureDefinition/bd-location` | | Medication | `https://fhir.dghs.gov.bd/core/StructureDefinition/bd-medication` | | MedicationRequest | `https://fhir.dghs.gov.bd/core/StructureDefinition/bd-medicationrequest` | | Immunization | `https://fhir.dghs.gov.bd/core/StructureDefinition/bd-immunization` | Resources of any other type are stored with `meta.tag = https://fhir.dghs.gov.bd/tags|unvalidated-profile`. They are not rejected. They can be queried with `_tag=https://fhir.dghs.gov.bd/tags|unvalidated-profile`. ### ICD-11 cluster expression format BD Core IG defines a specific pattern for postcoordinated ICD-11 expressions. **Raw postcoordinated strings in `Coding.code` are prohibited.** **Correct format:** ```json "code": { "coding": [{ "system": "http://id.who.int/icd/release/11/mms", "code": "1C62.0", "extension": [{ "url": "icd11-cluster-expression", "valueString": "1C62.0/http%3A%2F%2Fid.who.int%2F..." }] }] } ``` **Prohibited format (rejected with 422):** ```json "code": { "coding": [{ "system": "http://id.who.int/icd/release/11/mms", "code": "1C62.0&has_severity=mild" }] } ``` ### Rejection codes The `rejection_code` column in `audit.fhir_rejected_submissions` contains one of: | Code | Meaning | |------|---------| | `PROFILE_VIOLATION` | Resource violates a BD Core IG SHALL constraint | | `TERMINOLOGY_INVALID_CODE` | ICD-11 code not found in OCL | | `TERMINOLOGY_INVALID_CLASS` | ICD-11 code exists but is not Diagnosis/Finding class | | `CLUSTER_EXPRESSION_INVALID` | Cluster expression failed cluster validator | | `CLUSTER_STEM_MISSING_EXTENSION` | Raw postcoordinated string without extension | | `AUTH_TOKEN_MISSING` | No Bearer token | | `AUTH_TOKEN_EXPIRED` | Token `exp` in the past | | `AUTH_TOKEN_INVALID_SIGNATURE` | Signature verification failed | | `AUTH_TOKEN_MISSING_ROLE` | `mci-api` role absent | | `AUTH_TOKEN_INVALID_ISSUER` | `iss` does not match Keycloak realm | --- ## 7. Audit and Forensics ### Two audit stores **`audit.audit_events`** — one row per request outcome. Always written, accepted and rejected. Contains: `event_type`, `operation`, `resource_type`, `resource_id`, `outcome`, `outcome_detail`, `sending_facility`, `client_id`, `subject`, `request_ip`, `request_id`, `validation_messages` (JSONB). **`audit.fhir_rejected_submissions`** — one row per rejected write. Contains: full resource payload as submitted (TEXT, not JSONB), `rejection_code`, `rejection_reason`, `element_path`, `violated_profile`, `invalid_code`, `invalid_system`. ### Immutability The `audit_writer_login` PostgreSQL user has INSERT only on the audit schema. The HAPI JVM connects to postgres-audit as this user. No UPDATE or DELETE is possible from the application layer regardless of what the application code attempts. Only a PostgreSQL superuser can modify audit records. ### Partitioning Both audit tables are partitioned by month (`PARTITION BY RANGE (event_time)`). Monthly partitions are pre-created through December 2027. A cron job must create next-month partitions on the 20th of each month. If this lapses, INSERT fails with a hard error. **Set up the cron job immediately after first deployment:** ```bash # On the host running postgres-audit crontab -e # Add: 0 0 20 * * docker exec bd-postgres-audit psql -U audit_maintainer_login -d auditdb \ -c "SELECT audit.create_next_month_partitions();" \ >> /var/log/bd-fhir-partition-maintenance.log 2>&1 ``` ### Useful audit queries ```sql -- Rejection rate by vendor, last 7 days SELECT client_id, COUNT(*) AS total, SUM(CASE WHEN outcome='REJECTED' THEN 1 ELSE 0 END) AS rejected, ROUND(100.0 * SUM(CASE WHEN outcome='REJECTED' THEN 1 ELSE 0 END) / COUNT(*), 1) AS pct FROM audit.audit_events WHERE event_time > NOW() - INTERVAL '7 days' AND event_type IN ('OPERATION','VALIDATION_FAILURE') GROUP BY client_id ORDER BY pct DESC; -- Retrieve rejected payloads for a vendor SELECT submission_time, resource_type, rejection_code, rejection_reason, element_path FROM audit.fhir_rejected_submissions WHERE client_id = 'fhir-vendor-{org-id}' ORDER BY submission_time DESC LIMIT 20; -- Auth failures SELECT event_time, client_id, outcome_detail, request_ip FROM audit.audit_events WHERE event_type = 'AUTH_FAILURE' ORDER BY event_time DESC LIMIT 20; ``` --- ## 8. CI/CD Pipeline The production server **never builds**. It only pulls pre-built images from the private registry. ### CI pipeline steps (on CI machine) ```bash # 1. Obtain BD Core IG package and place it cp /path/to/bd.gov.dghs.core-0.2.1.tgz \ hapi-overlay/src/main/resources/packages/ # 2. Run tests (TestContainers spins up real PostgreSQL — no H2) mvn test -pl hapi-overlay -am # 3. Build Docker image (multi-stage: Maven builder + JRE runtime) docker build \ --build-arg IG_PACKAGE=bd.gov.dghs.core-0.2.1.tgz \ --build-arg BUILD_VERSION=1.0.0 \ --build-arg GIT_COMMIT=$(git rev-parse --short HEAD) \ -t your-registry.dghs.gov.bd/bd-fhir-hapi:1.0.0 \ -f hapi-overlay/Dockerfile \ . # 4. Push to private registry docker push your-registry.dghs.gov.bd/bd-fhir-hapi:1.0.0 ``` The `packages/` directory must contain exactly one `.tgz` file matching `HAPI_IG_PACKAGE_CLASSPATH` in `.env`. If the directory is empty or the filename does not match, the container fails startup immediately with a clear error message. --- ## 9. First Deployment — Step by Step ### Prerequisites - Ubuntu 22.04 LTS, minimum 8GB RAM, 4 vCPU, 100GB disk - Outbound HTTPS to Keycloak, OCL, cluster validator, private registry - Docker image already built and pushed (see Section 8) - Keycloak configured (see `ops/keycloak-setup.md`) ### Step 1 — Install Docker ```bash sudo apt-get update sudo apt-get install -y ca-certificates curl sudo install -m 0755 -d /etc/apt/keyrings sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg \ -o /etc/apt/keyrings/docker.asc sudo chmod a+r /etc/apt/keyrings/docker.asc echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] \ https://download.docker.com/linux/ubuntu \ $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update sudo apt-get install -y docker-ce docker-ce-cli containerd.io \ docker-buildx-plugin docker-compose-plugin sudo usermod -aG docker $USER # log out and back in ``` ### Step 2 — Prepare application directory ```bash sudo mkdir -p /opt/bd-fhir-national sudo chown $USER:$USER /opt/bd-fhir-national # rsync project files from CI/deployment machine (excluding source tree) rsync -avz --exclude='.git' --exclude='hapi-overlay/target' \ --exclude='hapi-overlay/src' \ ./bd-fhir-national/ deploy@server:/opt/bd-fhir-national/ ``` ### Step 3 — Create .env ```bash cd /opt/bd-fhir-national cp .env.example .env chmod 600 .env nano .env # fill all values # verify: grep CHANGE_ME .env should return nothing ``` ### Step 4 — Fix init scripts (CRITICAL — do not skip) The `postgres/fhir/init.sql` and `postgres/audit/init.sql` files are templates with placeholder passwords. PostgreSQL Docker does not perform variable substitution in `.sql` files. Replace them with `.sh` scripts that read from environment variables. ```bash # FHIR database init script cat > /opt/bd-fhir-national/postgres/fhir/init.sh <<'EOF' #!/bin/bash set -e psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL DO \$\$ BEGIN IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = '${FHIR_DB_APP_USER}') THEN CREATE USER ${FHIR_DB_APP_USER} WITH NOSUPERUSER NOCREATEDB NOCREATEROLE NOINHERIT LOGIN CONNECTION LIMIT 30 PASSWORD '${FHIR_DB_APP_PASSWORD}'; END IF; END \$\$; GRANT CONNECT ON DATABASE ${POSTGRES_DB} TO ${FHIR_DB_APP_USER}; GRANT USAGE ON SCHEMA public TO ${FHIR_DB_APP_USER}; ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO ${FHIR_DB_APP_USER}; ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT USAGE, SELECT ON SEQUENCES TO ${FHIR_DB_APP_USER}; EOSQL EOF chmod +x /opt/bd-fhir-national/postgres/fhir/init.sh # Audit database init script cat > /opt/bd-fhir-national/postgres/audit/init.sh <<'EOF' #!/bin/bash set -e psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL DO \$\$ BEGIN IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = '${AUDIT_DB_WRITER_USER}') THEN CREATE USER ${AUDIT_DB_WRITER_USER} WITH NOSUPERUSER NOCREATEDB NOCREATEROLE NOINHERIT LOGIN CONNECTION LIMIT 20 PASSWORD '${AUDIT_DB_WRITER_PASSWORD}'; END IF; IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = '${AUDIT_DB_MAINTAINER_USER}') THEN CREATE USER ${AUDIT_DB_MAINTAINER_USER} WITH NOSUPERUSER NOCREATEDB NOCREATEROLE NOINHERIT LOGIN CONNECTION LIMIT 5 PASSWORD '${AUDIT_DB_MAINTAINER_PASSWORD}'; END IF; END \$\$; GRANT CONNECT ON DATABASE ${POSTGRES_DB} TO ${AUDIT_DB_WRITER_USER}; GRANT CONNECT ON DATABASE ${POSTGRES_DB} TO ${AUDIT_DB_MAINTAINER_USER}; EOSQL EOF chmod +x /opt/bd-fhir-national/postgres/audit/init.sh ``` Update `docker-compose.yml` — in both postgres services, change the init volume mount from `.sql` to `.sh`, and pass the necessary env vars to `postgres-audit`: ```yaml # postgres-fhir volumes: change - ./postgres/fhir/init.sh:/docker-entrypoint-initdb.d/init.sh:ro # add to postgres-fhir environment: FHIR_DB_APP_USER: ${FHIR_DB_APP_USER} FHIR_DB_APP_PASSWORD: ${FHIR_DB_APP_PASSWORD} # postgres-audit volumes: change - ./postgres/audit/init.sh:/docker-entrypoint-initdb.d/init.sh:ro # add to postgres-audit environment: AUDIT_DB_WRITER_USER: ${AUDIT_DB_WRITER_USER} AUDIT_DB_WRITER_PASSWORD: ${AUDIT_DB_WRITER_PASSWORD} AUDIT_DB_MAINTAINER_USER: ${AUDIT_DB_MAINTAINER_USER} AUDIT_DB_MAINTAINER_PASSWORD: ${AUDIT_DB_MAINTAINER_PASSWORD} ``` ### Step 5 — Registry login ```bash docker login your-registry.dghs.gov.bd docker compose --env-file .env pull ``` ### Step 6 — Start databases ```bash docker compose --env-file .env up -d postgres-fhir postgres-audit # wait for healthy until docker compose --env-file .env ps postgres-fhir | grep -q "healthy"; do sleep 3; done until docker compose --env-file .env ps postgres-audit | grep -q "healthy"; do sleep 3; done ``` ### Step 7 — Verify database users ```bash docker exec bd-postgres-fhir psql -U postgres -d fhirdb \ -c "SELECT rolname FROM pg_roles WHERE rolname='hapi_app';" # Expected: hapi_app docker exec bd-postgres-audit psql -U postgres -d auditdb \ -c "SELECT rolname FROM pg_roles WHERE rolname IN ('audit_writer_login','audit_maintainer_login');" # Expected: two rows ``` ### Step 8 — Start pgBouncer and HAPI ```bash docker compose --env-file .env up -d pgbouncer-fhir pgbouncer-audit docker compose --env-file .env up -d hapi # Follow startup — takes 60-120 seconds docker compose --env-file .env logs -f hapi ``` Expected log sequence: ``` Running FHIR Flyway migrations... → V1 applied Running Audit Flyway migrations... → V2 applied Advisory lock acquired... → IG loading begins BD Core IG package loaded... → IG ready BdTerminologyValidationSupport initialised... KeycloakJwtInterceptor initialised... HAPI RestfulServer interceptors registered... Tomcat started on port(s): 8080 Started BdFhirApplication in XX seconds ``` ### Step 9 — Verify health ```bash # Internal (direct to HAPI) docker exec $(docker compose --env-file .env ps -q hapi | head -1) \ curl -s http://localhost:8080/actuator/health | jq . # All components must show status: UP # FHIR metadata docker exec $(docker compose --env-file .env ps -q hapi | head -1) \ curl -s http://localhost:8080/fhir/metadata | jq '.software' # Expected: { "name": "BD FHIR National Repository", "version": "0.2.1" } ``` ### Step 10 — Set up partition maintenance cron ```bash crontab -e # Add: 0 0 20 * * docker exec bd-postgres-audit psql -U audit_maintainer_login -d auditdb \ -c "SELECT audit.create_next_month_partitions();" \ >> /var/log/bd-fhir-partition-maintenance.log 2>&1 ``` ### Step 11 — Run acceptance tests Run all tests from Section 9.3 of `ops/deployment-guide.md`. All nine must pass before the system is declared production-ready. --- ## 10. Routine Operations ### View logs ```bash # All services docker compose --env-file .env logs -f # HAPI logs as structured JSON docker compose --env-file .env logs -f hapi | jq -R 'try fromjson' # Filter for rejections docker compose --env-file .env logs hapi | \ jq -R 'try fromjson | select(.message | test("rejected|REJECTED"))' ``` ### Deploy a new image version ```bash # Update image tag in .env nano /opt/bd-fhir-national/.env # Change HAPI_IMAGE to new tag # Pull and redeploy docker compose --env-file .env pull hapi docker compose --env-file .env up -d --no-deps hapi # Verify startup docker compose --env-file .env logs -f hapi ``` ### Scale HAPI replicas ```bash docker compose --env-file .env up -d --scale hapi=3 # No other configuration changes needed at 3 replicas. # pgBouncer pool_size=20 supports up to 4 replicas at HikariCP max=5. # At 5+ replicas: increase PGBOUNCER_DEFAULT_POOL_SIZE and postgres max_connections first. ``` ### Restart a service ```bash docker compose --env-file .env restart hapi docker compose --env-file .env restart postgres-fhir # causes brief HAPI downtime ``` ### Full stack restart ```bash docker compose --env-file .env down docker compose --env-file .env up -d ``` ### Check pgBouncer pool status ```bash docker exec bd-pgbouncer-fhir psql -h localhost -p 5432 -U pgbouncer pgbouncer \ -c "SHOW POOLS;" ``` --- ## 11. ICD-11 Version Upgrade When a new ICD-11 MMS release is imported into OCL, the HAPI terminology cache becomes stale. The upgrade pipeline must flush the cache after OCL import. Full procedure in `ops/version-upgrade-integration.md`. Summary: **Order is mandatory:** 1. OCL: import new ICD-11 concepts 2. OCL: patch `concept_class` for Diagnosis + Finding 3. OCL: repopulate `bd-condition-icd11-diagnosis-valueset` 4. OCL: verify `$validate-code` returns correct results for new codes 5. HAPI: flush terminology cache 6. HAPI: verify new codes validate correctly **Step 5 — cache flush:** ```bash # Get fhir-admin token ADMIN_TOKEN=$(curl -s -X POST \ "https://auth.dghs.gov.bd/realms/hris/protocol/openid-connect/token" \ -d "grant_type=client_credentials" \ -d "client_id=fhir-admin-pipeline" \ -d "client_secret=${FHIR_ADMIN_CLIENT_SECRET}" \ | jq -r '.access_token') # Flush — run from inside Docker network (admin endpoint is network-restricted) docker exec $(docker compose --env-file .env ps -q hapi | head -1) \ curl -s -X DELETE \ -H "Authorization: Bearer ${ADMIN_TOKEN}" \ http://localhost:8080/admin/terminology/cache | jq . # Expected: { "status": "flushed", "entriesEvicted": N } ``` **IG version upgrade** (when BD Core IG advances to a new version): 1. Place new `.tgz` in `src/main/resources/packages/`, remove old one. 2. Update `HAPI_IG_PACKAGE_CLASSPATH` and `HAPI_IG_VERSION` in `.env`. 3. Build and push new Docker image on CI machine. 4. Deploy new image on production server. --- ## 12. Scaling ### Current capacity (Phase 1 — Pilot) | Metric | Capacity | |--------|----------| | HAPI replicas | 1 | | Vendors | <50 | | Resources/day | <10,000 | | PostgreSQL connections (FHIR) | 5 | | PostgreSQL connections (Audit) | 2 | ### Scaling to Phase 2 (Regional — up to 500 vendors, 100,000 resources/day) ```bash # Scale HAPI to 3 replicas — no other changes required docker compose --env-file .env up -d --scale hapi=3 ``` Beyond 3 replicas, update pgBouncer pool sizes and PostgreSQL `max_connections` before scaling. See `ops/scaling-roadmap.md` for the full capacity matrix and Phase 3 (national scale → Kubernetes) guidance. --- ## 13. Troubleshooting ### Container not starting ```bash docker compose --env-file .env logs hapi | tail -50 ``` | Log message | Cause | Fix | |-------------|-------|-----| | `STARTUP FAILURE: BD Core IG package not found` | `.tgz` missing from image | Rebuild image with package in `packages/` | | `FHIR Flyway configuration missing` | `SPRING_FLYWAY_*` env vars not set | Check `.env` | | `password authentication failed for user "hapi_app"` | `init.sh` not run or wrong password | Verify Step 4 of deployment, check `.env` passwords | | `Advisory lock acquisition timed out` | Another replica holding lock and crashed mid-init | Check `pg_locks` on postgres-fhir, kill stale lock | | `Connection refused` to Keycloak JWKS | Keycloak unreachable at startup | Check network connectivity, Keycloak health | | `Schema-validation: missing table` | Flyway did not run | Check `SPRING_FLYWAY_*` env vars, check flyway_schema_history table | ### 401 on all authenticated requests ```bash # Check JWKS endpoint is reachable from inside the container docker exec $(docker compose --env-file .env ps -q hapi | head -1) \ curl -s https://auth.dghs.gov.bd/realms/hris/protocol/openid-connect/certs | jq '.keys | length' # Expected: 1 or more keys ``` If JWKS is unreachable, all requests will be rejected with 401 (fail closed). Check firewall rules — the HAPI container must have outbound HTTPS to Keycloak. ### 422 on all ICD-11 coded submissions ```bash # Check OCL is reachable docker exec $(docker compose --env-file .env ps -q hapi | head -1) \ curl -s -o /dev/null -w "%{http_code}" \ "https://tr.ocl.dghs.gov.bd/api/fhir/metadata" # Expected: 200 # Check a specific code manually docker exec $(docker compose --env-file .env ps -q hapi | head -1) \ curl -s "https://tr.ocl.dghs.gov.bd/api/fhir/ValueSet/\$validate-code?\ url=https://fhir.dghs.gov.bd/core/ValueSet/bd-condition-icd11-diagnosis-valueset\ &system=http://id.who.int/icd/release/11/mms&code=1C62.0" | jq . ``` If OCL is unreachable, the system should be fail-open (codes accepted). If codes are being rejected despite OCL being reachable, check OCL's `$validate-code` response directly. ### Audit writes failing ```bash # Check HAPI logs for "AUDIT WRITE FAILED" docker compose --env-file .env logs hapi | grep "AUDIT WRITE FAILED" # Check audit datasource health docker exec $(docker compose --env-file .env ps -q hapi | head -1) \ curl -s http://localhost:8080/actuator/health | jq '.components.auditDb' ``` ### Partition missing (INSERT to audit failing) ```bash # Check which partitions exist docker exec bd-postgres-audit psql -U postgres -d auditdb -c " SELECT c.relname FROM pg_class c JOIN pg_inherits i ON i.inhrelid = c.oid JOIN pg_class p ON p.oid = i.inhparent JOIN pg_namespace n ON n.oid = p.relnamespace WHERE n.nspname = 'audit' AND p.relname = 'audit_events' ORDER BY c.relname DESC LIMIT 3;" # Create missing partition manually docker exec bd-postgres-audit psql -U postgres -d auditdb \ -c "SELECT audit.create_next_month_partitions();" ``` ### Check disk usage ```bash docker system df -v df -h /var/lib/docker ``` --- ## 14. Architecture Decisions You Must Not Reverse These decisions are load-bearing. Reversing any of them without fully understanding the consequences will break the system. **PostgreSQL only — no H2, not even for tests.** The test suite uses TestContainers to spin up real PostgreSQL 15. H2 is not on the classpath. Using H2 masks database-specific behaviour (advisory locks, partitioning, JSONB) and produces false-green test results. **Validation on ALL requests — no vendor exemptions.** The `RequestValidatingInterceptor` runs on every write. There is no per-vendor or per-resource-type bypass. This is the HIE boundary enforcement. A bypass for one vendor breaks the national data quality guarantee for everyone downstream. **OCL is the single terminology authority.** There is no local ICD-11 concept store. All ICD-11 validation goes to OCL. This means OCL availability affects HAPI validation quality. Keep OCL healthy. Do not add a local fallback without understanding the implications for version consistency. **`$expand` is never attempted for ICD-11 ValueSets.** OCL does not support `$expand`. The `isValueSetSupported()` override returns `false` for all ICD-11 ValueSets. Do not remove this — removing it causes HAPI to attempt `$expand`, receive an empty response, and reject every ICD-11 coded resource regardless of whether the code is valid. **pgBouncer must remain in session mode.** Hibernate uses prepared statements and advisory locks. Transaction mode pgBouncer breaks both. Do not change `PGBOUNCER_POOL_MODE` to `transaction`. **Flyway owns all DDL — Hibernate never modifies schema.** `ddl-auto: validate` means Hibernate will refuse to start if the schema does not match its entities, but it will never ALTER or CREATE tables. If a HAPI upgrade changes entity mappings, write a Flyway migration. Never change `ddl-auto` to `update` in production. **Audit writes are append-only.** The `audit_writer_login` PostgreSQL user has INSERT only. The application cannot UPDATE or DELETE audit records regardless of what the code does. This is enforced at the database level. Do not grant additional privileges to this user. **The IG package is bundled in the Docker image.** The `.tgz` is a build-time artifact, not a runtime configuration. There is no hot-reload. An IG upgrade requires a new Docker image build and deployment. This is by design — it ties IG version to container version, making deployments auditable and rollbacks clean.