- health-summary: gateway status, PID, uptime, CPU/RAM/swap/disk, compaction
- cron-summary: cron jobs with schedule, status, failures, model
- sessions-summary: active sessions with model display names, context %, tokens
- sub-agents-summary: sub-agent runs with cost, duration, status, tokens
- trends: cost/token daily trends with 7d/30d range filter
All endpoints are org-scoped, support gateway_id filtering, and use
data_processing functions (ModelName, BuildDailyChart) where appropriate.
Syntax validated with py_compile.
- CostSummaryRead schema + GET /monitoring/cost-summary (latest snapshot per gateway)
- CostBreakdownRead schema + GET /monitoring/cost-breakdown (models ranked by cost with %)
- Both endpoints support ?gateway_id= filtering and org-scoping
- Updated FUTURE.md: dashboard logic port and WebSocket marked done, remaining 5 summary endpoints queued
Dashboard Data Processing (data_processing.py):
- ModelName() — maps raw provider/model strings to display names
- BuildDailyChart() — aggregates cost/token/call data into daily chart buckets
- BuildAlerts() — evaluates cost, cron, context, gateway, memory alert conditions
- BuildCostBreakdown() — ranks models by cost descending
- FmtTokens() — formats token counts (1.2M, 1.5K, etc.)
- round2(), sum_bucket_costs(), TitleCase() — utility functions
- All pure Python, no I/O or RPC — transforms data from DB
WebSocket Agent Events (ws.py + event_parser.py):
- WebSocket endpoint at /ws/agents for real-time agent event broadcasting
- On connect: sends last 50 session events as initial state
- Background task polls SessionEvent table every 2s for new events
- Parses events into dashboard format (agentStatus, agentToolStart, etc.)
- broadcast_event() for other services to push events in real-time
- Lifespan integration: start/stop broadcast task with app startup/shutdown
Fixed in Ripley review:
- Removed duplicate method definitions in WebSocketConnectionManager
- Fixed broken import in main.py (dangling tuple)
- Removed inline import re in event_parser.py (already at module level)
- Fixed duplicate memory_search/memory_get entries in format_tool_status
- Used asyncio.get_running_loop() instead of deprecated get_event_loop()
- Cleaned up broadcast_to_all with proper broken connection cleanup
- GET /monitoring/cost-snapshots + /{id} — list/detail with gateway_id, date range filters
- GET /monitoring/cron-jobs + /{id} — list/detail with gateway_id, enabled, job_name filters
- GET /monitoring/sessions + /{id} — list/detail with gateway_id, session_key, model, event_type filters
- GET /monitoring/health + /{id} — list/detail with gateway_id, date range filters
- GET /monitoring/sub-agents + /{id} — list/detail with gateway_id, status, agent filters
- All endpoints org-scoped (require_org_member), paginated (DefaultLimitOffsetPage)
- Pydantic Read schemas for all 5 monitoring models
- Router registered in main.py
- Removed unused imports (OkResponse, utcnow)
- Add monitoring.py: CostSnapshot, CronJobStatus, SessionEvent, SubAgentRun, SystemHealthMetric
- Add alert_rules.py: AlertRule, AlertEvent
- Register all 7 models in __init__.py
- Add Alembic migration 7a8b9c0d1e2f for 7 new monitoring tables
- Add Alembic migration 8f9a0b1c2d3e for CASCADE FK rules, composite indexes, acknowledged_by FK
- Update env.py for transaction-per-migration to avoid failure chaining
- Security: ondelete CASCADE on all org/gateway FKs, SET NULL on acknowledged_by
- Performance: composite indexes on (org_id, created_at) and (org_id, gateway_id) for all monitoring tables