Monitoring¶
Service UIs¶
| Tool | URL (dev) | Purpose |
|---|---|---|
| Temporal UI | localhost:8088 | Workflow execution history, task queue status, running/failed workflows |
| Redpanda Console | localhost:8080 | Topic inspection, consumer group lag, message browsing |
| Meilisearch Dashboard | localhost:7700 | Search index stats, document counts |
| MinIO Console | localhost:9001 | Object storage bucket inspection |
| Strawberry GraphQL IDE | localhost:8000/graphql | Interactive API testing |
| Prometheus | localhost:9090 | Metrics scrape + ad-hoc PromQL |
| Grafana | localhost:3000 | Dashboards (default login admin / admin); the seeded vectis_api.json dashboard renders the 6-panel API view |
Application Logging¶
Vectis uses Python's standard logging module. All services log to stdout for container-native collection.
Log Levels¶
| Level | When Used |
|---|---|
INFO | Request handling, order creation, state transitions |
WARNING | Non-critical issues — stale cache, missing optional config |
ERROR | Failures — database errors, payment gateway failures, unhandled exceptions |
DEBUG | Detailed tracing — SQL queries, event payloads (development only) |
Configure with the LOG_LEVEL environment variable (default: INFO).
Temporal Workflows¶
Monitor long-running business processes in the Temporal UI. Notable workflows:
- OrderLifecycleWorkflow — tracks order from creation through fulfillment
- RecurringOrderWorkflow — scheduled subscription order placement
- CustomerApprovalWorkflow + OrderApprovalWorkflow — B2B account registration and order approval
- BulkPriceImportWorkflow / ImportEntityWorkflow — bulk data imports
- RefundExecutionWorkflow (refund_approval module) — per-tender refund execution with idempotent retry
- VoidExpiringCardAuthsWorkflow (Decided C10) — daily sweep of card auths that aged past their expiry
- ExpireStaleReservationsWorkflow — TTL expiry on HELD inventory reservations (Decided #160)
- GcExpiredOverdraftDraftsWorkflow — store-credit overdraft draft GC (Decided #173)
- GcExpiredRmaDraftsWorkflow, GcExpiredCartTendersWorkflow, CleanupExpiredCartsWorkflow — periodic GC
- RebuildComplianceCacheWorkflow — nightly compliance cache refresh
Check the Task Queues tab to verify workers are connected and processing tasks.
Warning
If workflows accumulate in "Running" state without progress, check that the Temporal worker is running and connected: make worker or the temporal-worker Docker service.
Redpanda Events¶
Key topics to monitor:
| Topic | Normal Volume | Alert If |
|---|---|---|
vectis.orders | Proportional to order volume | Consumer lag > 1000 messages |
vectis.inventory | Proportional to stock adjustments | Consumer lag growing steadily |
vectis.accounts | Low (account creation/updates) | Any consumer errors |
Use the Redpanda Console to check consumer group lag and browse recent messages for debugging.
Key Metrics¶
For production monitoring, expose and track:
| Metric | Source | Threshold |
|---|---|---|
| API response time (p95) | Uvicorn access logs | < 500ms |
| Database connection pool utilization | SQLAlchemy pool stats | < 80% |
| Redis memory usage | Redis INFO command | < available memory |
| Temporal workflow failure rate | Temporal metrics | < 1% |
| Redpanda consumer lag | Consumer group offset | < 1000 |
| Meilisearch index freshness | Last indexed timestamp | < 5 min lag |
Alerting Recommendations¶
- API 5xx rate > 1% — check application logs for stack traces
- Database connections exhausted — increase pool size or investigate slow queries
- Temporal task queue backlog — add worker replicas
- Redpanda consumer lag increasing — event consumer crashed or overwhelmed
- Workflow faults emitted — check
Analytics → Workflow Faultsin the admin; a sudden spike usually means a setting change broke a workflow assumption - Schema validation warnings on Redpanda emit — schemas in
backend/vectis/events/schemas/diverged from a producer; fix the producer or update the schema
Application Metrics (Decided #154)¶
The API exposes two endpoints for orchestration and metrics scraping:
| Endpoint | Purpose |
|---|---|
GET /metrics | Prometheus-format metrics from prometheus-fastapi-instrumentator: http_requests_total, http_request_duration_seconds, http_requests_inprogress, http_response_size_bytes |
GET /ready | Readiness probe — parallel-checks DB, Redis, and Redpanda; returns 200 OK only when all three are healthy. Use this for load-balancer health and Kubernetes readiness probes |
GET /health | Cheap liveness check (no dependency probes). Use for container restart loops |
{ health } GraphQL query | Legacy in-graph health field (returns "Vectis Commerce API is healthy") |
The seeded Grafana dashboard (infra/grafana/dashboards/vectis_api.json) renders six panels: request rate, error rate by status, request latency (p50/p95/p99), in-progress requests, payload size (p95), plus a Totals header.
Token-Bucket Rate Limiting (Decided #157)¶
The rate limiter (backend/vectis/core/rate_limit.py) is a proper token-bucket: burst capacity is the ceiling, not a steady-state floor. Per-endpoint policies live in the rate_limit_policies table (model in backend/vectis/modules/rate_limit/models.py) and reload on change. The limiter attributes calls by X-Forwarded-For (Decided #165), so a BFF that forwards the real client IP gets accurate per-client buckets rather than collapsing everything into one BFF-IP bucket.
JSON Schema Event Registry (Decided #153)¶
Producer-side payloads validate against the schema registry in backend/vectis/events/schemas/. Dev and test default to strict mode (invalid payloads raise); production defaults to warn-and-publish (logs a structured warning so a bad payload never blocks a transaction). Override with EVENT_SCHEMA_MODE=strict|warn. The set of registered topics lives in vectis/docs/REDPANDA_TOPICS.md.