Monitoring¶

Service UIs¶

Tool	URL (dev)	Purpose
Temporal UI	localhost:8088	Workflow execution history, task queue status, running/failed workflows
Redpanda Console	localhost:8080	Topic inspection, consumer group lag, message browsing
Meilisearch Dashboard	localhost:7700	Search index stats, document counts
MinIO Console	localhost:9001	Object storage bucket inspection
Strawberry GraphQL IDE	localhost:8000/graphql	Interactive API testing
Prometheus	localhost:9090	Metrics scrape + ad-hoc PromQL
Grafana	localhost:3000	Dashboards (default login `admin / admin`); the seeded `vectis_api.json` dashboard renders the 6-panel API view

Application Logging¶

Vectis uses Python's standard logging module. All services log to stdout for container-native collection.

Log Levels¶

Level	When Used
`INFO`	Request handling, order creation, state transitions
`WARNING`	Non-critical issues — stale cache, missing optional config
`ERROR`	Failures — database errors, payment gateway failures, unhandled exceptions
`DEBUG`	Detailed tracing — SQL queries, event payloads (development only)

Configure with the LOG_LEVEL environment variable (default: INFO).

Temporal Workflows¶

Monitor long-running business processes in the Temporal UI. Notable workflows:

OrderLifecycleWorkflow — tracks order from creation through fulfillment
RecurringOrderWorkflow — scheduled subscription order placement
CustomerApprovalWorkflow + OrderApprovalWorkflow — B2B account registration and order approval
BulkPriceImportWorkflow / ImportEntityWorkflow — bulk data imports
RefundExecutionWorkflow (refund_approval module) — per-tender refund execution with idempotent retry
VoidExpiringCardAuthsWorkflow (Decided C10) — daily sweep of card auths that aged past their expiry
ExpireStaleReservationsWorkflow — TTL expiry on HELD inventory reservations (Decided #160)
GcExpiredOverdraftDraftsWorkflow — store-credit overdraft draft GC (Decided #173)
GcExpiredRmaDraftsWorkflow, GcExpiredCartTendersWorkflow, CleanupExpiredCartsWorkflow — periodic GC
RebuildComplianceCacheWorkflow — nightly compliance cache refresh

Check the Task Queues tab to verify workers are connected and processing tasks.

Warning

If workflows accumulate in "Running" state without progress, check that the Temporal worker is running and connected: make worker or the temporal-worker Docker service.

Redpanda Events¶

Key topics to monitor:

Topic	Normal Volume	Alert If
`vectis.orders`	Proportional to order volume	Consumer lag > 1000 messages
`vectis.inventory`	Proportional to stock adjustments	Consumer lag growing steadily
`vectis.accounts`	Low (account creation/updates)	Any consumer errors

Use the Redpanda Console to check consumer group lag and browse recent messages for debugging.

Key Metrics¶

For production monitoring, expose and track:

Metric	Source	Threshold
API response time (p95)	Uvicorn access logs	< 500ms
Database connection pool utilization	SQLAlchemy pool stats	< 80%
Redis memory usage	Redis `INFO` command	< available memory
Temporal workflow failure rate	Temporal metrics	< 1%
Redpanda consumer lag	Consumer group offset	< 1000
Meilisearch index freshness	Last indexed timestamp	< 5 min lag

Alerting Recommendations¶

API 5xx rate > 1% — check application logs for stack traces
Database connections exhausted — increase pool size or investigate slow queries
Temporal task queue backlog — add worker replicas
Redpanda consumer lag increasing — event consumer crashed or overwhelmed
Workflow faults emitted — check Analytics → Workflow Faults in the admin; a sudden spike usually means a setting change broke a workflow assumption
Schema validation warnings on Redpanda emit — schemas in backend/vectis/events/schemas/ diverged from a producer; fix the producer or update the schema

Application Metrics (Decided #154)¶

The API exposes two endpoints for orchestration and metrics scraping:

Endpoint	Purpose
`GET /metrics`	Prometheus-format metrics from `prometheus-fastapi-instrumentator`: `http_requests_total`, `http_request_duration_seconds`, `http_requests_inprogress`, `http_response_size_bytes`
`GET /ready`	Readiness probe — parallel-checks DB, Redis, and Redpanda; returns `200 OK` only when all three are healthy. Use this for load-balancer health and Kubernetes readiness probes
`GET /health`	Cheap liveness check (no dependency probes). Use for container restart loops
`{ health }` GraphQL query	Legacy in-graph health field (returns `"Vectis Commerce API is healthy"`)

The seeded Grafana dashboard (infra/grafana/dashboards/vectis_api.json) renders six panels: request rate, error rate by status, request latency (p50/p95/p99), in-progress requests, payload size (p95), plus a Totals header.

Token-Bucket Rate Limiting (Decided #157)¶

The rate limiter (backend/vectis/core/rate_limit.py) is a proper token-bucket: burst capacity is the ceiling, not a steady-state floor. Per-endpoint policies live in the rate_limit_policies table (model in backend/vectis/modules/rate_limit/models.py) and reload on change. The limiter attributes calls by X-Forwarded-For (Decided #165), so a BFF that forwards the real client IP gets accurate per-client buckets rather than collapsing everything into one BFF-IP bucket.

JSON Schema Event Registry (Decided #153)¶

Producer-side payloads validate against the schema registry in backend/vectis/events/schemas/. Dev and test default to strict mode (invalid payloads raise); production defaults to warn-and-publish (logs a structured warning so a bad payload never blocks a transaction). Override with EVENT_SCHEMA_MODE=strict|warn. The set of registered topics lives in vectis/docs/REDPANDA_TOPICS.md.