Skip to content

Monitoring

Service UIs

Tool URL (dev) Purpose
Temporal UI localhost:8088 Workflow execution history, task queue status, running/failed workflows
Redpanda Console localhost:8080 Topic inspection, consumer group lag, message browsing
Meilisearch Dashboard localhost:7700 Search index stats, document counts
MinIO Console localhost:9001 Object storage bucket inspection
Strawberry GraphQL IDE localhost:8000/graphql Interactive API testing
Prometheus localhost:9090 Metrics scrape + ad-hoc PromQL
Grafana localhost:3000 Dashboards (default login admin / admin); the seeded vectis_api.json dashboard renders the 6-panel API view

Application Logging

Vectis uses Python's standard logging module. All services log to stdout for container-native collection.

Log Levels

Level When Used
INFO Request handling, order creation, state transitions
WARNING Non-critical issues — stale cache, missing optional config
ERROR Failures — database errors, payment gateway failures, unhandled exceptions
DEBUG Detailed tracing — SQL queries, event payloads (development only)

Configure with the LOG_LEVEL environment variable (default: INFO).

Temporal Workflows

Monitor long-running business processes in the Temporal UI. Notable workflows:

  • OrderLifecycleWorkflow — tracks order from creation through fulfillment
  • RecurringOrderWorkflow — scheduled subscription order placement
  • CustomerApprovalWorkflow + OrderApprovalWorkflow — B2B account registration and order approval
  • BulkPriceImportWorkflow / ImportEntityWorkflow — bulk data imports
  • RefundExecutionWorkflow (refund_approval module) — per-tender refund execution with idempotent retry
  • VoidExpiringCardAuthsWorkflow (Decided C10) — daily sweep of card auths that aged past their expiry
  • ExpireStaleReservationsWorkflow — TTL expiry on HELD inventory reservations (Decided #160)
  • GcExpiredOverdraftDraftsWorkflow — store-credit overdraft draft GC (Decided #173)
  • GcExpiredRmaDraftsWorkflow, GcExpiredCartTendersWorkflow, CleanupExpiredCartsWorkflow — periodic GC
  • RebuildComplianceCacheWorkflow — nightly compliance cache refresh

Check the Task Queues tab to verify workers are connected and processing tasks.

Warning

If workflows accumulate in "Running" state without progress, check that the Temporal worker is running and connected: make worker or the temporal-worker Docker service.

Redpanda Events

Key topics to monitor:

Topic Normal Volume Alert If
vectis.orders Proportional to order volume Consumer lag > 1000 messages
vectis.inventory Proportional to stock adjustments Consumer lag growing steadily
vectis.accounts Low (account creation/updates) Any consumer errors

Use the Redpanda Console to check consumer group lag and browse recent messages for debugging.

Key Metrics

For production monitoring, expose and track:

Metric Source Threshold
API response time (p95) Uvicorn access logs < 500ms
Database connection pool utilization SQLAlchemy pool stats < 80%
Redis memory usage Redis INFO command < available memory
Temporal workflow failure rate Temporal metrics < 1%
Redpanda consumer lag Consumer group offset < 1000
Meilisearch index freshness Last indexed timestamp < 5 min lag

Alerting Recommendations

  • API 5xx rate > 1% — check application logs for stack traces
  • Database connections exhausted — increase pool size or investigate slow queries
  • Temporal task queue backlog — add worker replicas
  • Redpanda consumer lag increasing — event consumer crashed or overwhelmed
  • Workflow faults emitted — check Analytics → Workflow Faults in the admin; a sudden spike usually means a setting change broke a workflow assumption
  • Schema validation warnings on Redpanda emit — schemas in backend/vectis/events/schemas/ diverged from a producer; fix the producer or update the schema

Application Metrics (Decided #154)

The API exposes two endpoints for orchestration and metrics scraping:

Endpoint Purpose
GET /metrics Prometheus-format metrics from prometheus-fastapi-instrumentator: http_requests_total, http_request_duration_seconds, http_requests_inprogress, http_response_size_bytes
GET /ready Readiness probe — parallel-checks DB, Redis, and Redpanda; returns 200 OK only when all three are healthy. Use this for load-balancer health and Kubernetes readiness probes
GET /health Cheap liveness check (no dependency probes). Use for container restart loops
{ health } GraphQL query Legacy in-graph health field (returns "Vectis Commerce API is healthy")

The seeded Grafana dashboard (infra/grafana/dashboards/vectis_api.json) renders six panels: request rate, error rate by status, request latency (p50/p95/p99), in-progress requests, payload size (p95), plus a Totals header.

Token-Bucket Rate Limiting (Decided #157)

The rate limiter (backend/vectis/core/rate_limit.py) is a proper token-bucket: burst capacity is the ceiling, not a steady-state floor. Per-endpoint policies live in the rate_limit_policies table (model in backend/vectis/modules/rate_limit/models.py) and reload on change. The limiter attributes calls by X-Forwarded-For (Decided #165), so a BFF that forwards the real client IP gets accurate per-client buckets rather than collapsing everything into one BFF-IP bucket.

JSON Schema Event Registry (Decided #153)

Producer-side payloads validate against the schema registry in backend/vectis/events/schemas/. Dev and test default to strict mode (invalid payloads raise); production defaults to warn-and-publish (logs a structured warning so a bad payload never blocks a transaction). Override with EVENT_SCHEMA_MODE=strict|warn. The set of registered topics lives in vectis/docs/REDPANDA_TOPICS.md.