Non-Functional Delta: 007-observability-lgtm-compose
Parent state: 006-messaging-nats-replacement
Document NFR changes introduced by this state.
Runtime / Operationsβ
- Add observability services to compose runtime:
- Grafana (
:3001, container3000) - Prometheus (
:9090) - Loki (
:3100) - Tempo (
:3200) - OTel Collector (
:4317,:4318,:13133) - Blackbox Exporter (
:9115) - Promtail (internal)
- Grafana (
- Keep all existing TraderX service ports unchanged from state
006.
Security / Complianceβ
- No authentication hardening added in this state; Grafana uses local development credentials by default.
- State is intended for local learning environments, not production deployment.
- As convergence level
C1, this state requires container build/publish CI with namespaceghcr.io/finos/traderx-c1/<component>. - Generated artifacts must include a GHCR run bundle so users can run the
C1environment from published images.
Performance / Scalabilityβ
- Prometheus probe interval defaults to 15 seconds to balance signal quality and local resource cost.
- Log scraping uses Docker service discovery and label relabeling for low-friction local operation.
Reliability / Observabilityβ
- Blackbox probe success and latency metrics are available for key TraderX endpoints.
- Spring Boot actuator Prometheus metrics are scraped for all compatible JVM services in this state.
- Prometheus-compatible metrics exposure is a required integration point: if a service supports it, scrape targets and dashboards must be updated in the same change.
- Container logs are queryable in Grafana via Loki.
- Smoke validation must verify Loki-backed dashboard data is non-empty for both total runtime streams and service-filtered panels (for example messaging, pricing pipeline, and control-plane views).
- OTel Collector and Tempo are wired for trace ingestion to support future instrumentation growth.
- Provisioned dashboards provide out-of-the-box visibility for service availability, latency, log throughput, and JVM/service metric health.