feat: add gateway data collection service + fix model FK definitions

- Created src/backend/app/services/monitoring/ package: - gateway_collector.py: Background asyncio task that polls gateway RPC endpoints (usage.cost, usage.status, cron.list, sessions.list/preview, health, status) and stores results in monitoring models using upsert pattern - models.py: Pydantic schemas for parsing gateway RPC responses - __init__.py: Package init, exports GatewayCollectorService - Added collector startup/shutdown in main.py lifespan: - Launches collector as background task when gateways exist - Clean shutdown on app termination - Fixed model FK definitions in monitoring.py and alert_rules.py: - Replaced Column(UUID, ForeignKey(...)) with Field(foreign_key=...) to match codebase pattern (UUID is Python class, not SQLAlchemy type) - Added missing gateway_id field to AlertRule model - Removed OpenClawDBService inheritance from GatewayCollectorService (uses session factory pattern instead of injected session) - Cleaned up duplicate/conflicting imports - Configurable collection intervals via env vars: COLLECTION_INTERVAL_COST (300s), COLLECTION_INTERVAL_CRON (60s), COLLECTION_INTERVAL_SESSION (30s), COLLECTION_INTERVAL_HEALTH (60s)
2026-05-10 20:13:16 -05:00 · 2026-05-10 20:13:16 -05:00 · d09822a821
parent 81794c4a5e
commit d09822a821
7 changed files with 802 additions and 21 deletions
--- a/FUTURE.md
+++ b/FUTURE.md
@ -3,7 +3,7 @@
 **This document tracks potential future enhancements for Mission Control.**
 **Last Updated:** 2026-05-10  
-**Current Version:** v0.0.2
+**Current Version:** v0.0.3
 ## How to Use This Document
@ -36,7 +36,7 @@ Items are grouped under their priority section heading (`## 🔴 CRITICAL`, `##
 ### 🟠 Gateway Data Collection Service — HIGH
 **Priority:** HIGH  
-**Status:** PENDING  
+**Status:** IN PROGRESS  
 **Added:** 2026-05-10 by Ripley
 **Description:**
@ -51,8 +51,11 @@ Port the dashboard's Go data collection (`refresh.go`, `refresh_sessions.go`, `r
 ### 🟠 Monitoring Database Models — HIGH
 **Priority:** HIGH  
-**Status:** PENDING  
+**Status:** DONE ✅  
 **Added:** 2026-05-10 by Ripley
 **Completed:** 2026-05-11
 All 7 models created, migrated, CASCADE + composite indexes verified in running DB. Committed as v0.0.3.
 **Description:**
 Create new PostgreSQL models for tracking data (cost, sessions, crons, system health, alerts).
@ -80,6 +83,27 @@ Add FastAPI WebSocket endpoint for real-time agent event broadcasting, porting t
 ### 🟡 MEDIUM
 ### 🟠 Dashboard Logic Port — HIGH
 **Priority:** HIGH  
 **Status:** PENDING  
 **Added:** 2026-05-11 by Ripley
 **Description:**
 Port the Go dashboard's data-processing logic that we're NOT yet reusing — model name normalization, daily cost charting, alert threshold computation, and token formatting.
 **Implementation Notes:**
 - The Go source (`sources/dashboard-tracking/internal/apprefresh/`) has significant logic beyond raw collection:
  - `ModelName()` — maps raw provider/model strings (e.g. `anthropic/claude-opus-4-6`) to display names ("Claude Opus 4.6")
  - `BuildDailyChart()` — aggregates cost/token data into daily buckets for chart rendering
  - `BuildAlerts()` — evaluates cost thresholds, cron failures, context usage, and gateway health against configurable rules
  - `FmtTokens()` — formats raw token counts (1,234,567 → "1.2M")
  - `BuildCostBreakdown()` — organizes per-model cost into ranked lists
 - We're already reusing the **gateway RPC layer** (same transport) and **data model shapes** (same fields)
 - What we're NOT reusing is the **processing/aggregation logic** — currently the collector just stores raw data
 - This must be ported as Python utility functions before building the monitoring frontend, so the API endpoints can serve pre-computed charts and alerts
 - Create `src/backend/app/services/monitoring/data_processing.py` for this logic
 - Also port `sources/dashboard-tracking/system_types.go` and `system_service.go` for system health data normalization
 ### 🟡 Cost Tracking UI — MEDIUM
 **Priority:** MEDIUM  
 **Status:** PENDING  
--- a/src/backend/app/main.py
+++ b/src/backend/app/main.py
@ -2,6 +2,7 @@
 from __future__ import annotations
 import asyncio
 from contextlib import asynccontextmanager
 from typing import TYPE_CHECKING, Any
@ -38,7 +39,10 @@ from app.core.rate_limit import validate_rate_limit_redis
 from app.core.rate_limit_backend import RateLimitBackend
 from app.core.security_headers import SecurityHeadersMiddleware
 from app.db.session import init_db
 from app.models.gateways import Gateway
 from app.schemas.health import HealthStatusResponse
 from app.services.monitoring import GatewayCollectorService
 from sqlmodel import select
 if TYPE_CHECKING:
    from collections.abc import AsyncIterator
@ -439,6 +443,29 @@ async def lifespan(_: FastAPI) -> AsyncIterator[None]:
        settings.db_auto_migrate,
    )
    await init_db()
    # Start the gateway monitoring collector as a background task
    # Check if any gateways are registered before starting the collector
    from app.db.session import async_session_maker as _session_maker
    async with _session_maker() as session:
        result = await session.execute(select(Gateway))
        gateways = result.scalars().all()
    if gateways:
        collector = GatewayCollectorService(_session_maker)
        collector_task = asyncio.create_task(collector.run())
        logger.info(
            "app.lifecycle.gateway_collector.start gateways=%d",
            len(gateways),
        )
        # Store the task and collector on the app state for cleanup
        _.state.gateway_collector_task = collector_task
        _.state.gateway_collector = collector
    else:
        logger.info("app.lifecycle.gateway_collector.no_gateways")
    if settings.rate_limit_backend == RateLimitBackend.REDIS:
        validate_rate_limit_redis(settings.rate_limit_redis_url)
        logger.info("app.lifecycle.rate_limit backend=redis")
@ -448,6 +475,17 @@ async def lifespan(_: FastAPI) -> AsyncIterator[None]:
    try:
        yield
    finally:
        # Shutdown the gateway collector
        if hasattr(_.state, "gateway_collector"):
            logger.info("app.lifecycle.gateway_collector.shutdown")
            try:
                await _.state.gateway_collector.shutdown()
            except Exception as exc:
                logger.error(
                    "app.lifecycle.gateway_collector.shutdown_error error=%s",
                    exc,
                    exc_info=True,
                )
        logger.info("app.lifecycle.stopped")
--- a/src/backend/app/models/alert_rules.py
+++ b/src/backend/app/models/alert_rules.py
@ -5,7 +5,7 @@ from __future__ import annotations
 from datetime import datetime
 from uuid import UUID, uuid4
-from sqlalchemy import JSON, Column, ForeignKey
+from sqlalchemy import JSON, Column, Index
 from sqlmodel import Field
 from app.core.time import utcnow
@ -18,7 +18,8 @@ class AlertRule(QueryModel, table=True):
    __tablename__ = "alert_rules"  # pyright: ignore[reportAssignmentType]
    id: UUID = Field(default_factory=uuid4, primary_key=True)
-    organization_id: UUID = Field(default=None, sa_column=Column(UUID, ForeignKey("organizations.id", ondelete="CASCADE"), index=True))
+    organization_id: UUID = Field(default=None, foreign_key="organizations.id", index=True)
    gateway_id: UUID | None = Field(default=None, foreign_key="gateways.id", index=True)
    name: str = Field(nullable=False)
    metric_type: str = Field(nullable=False, index=True)
    threshold: float = Field(nullable=False)
@ -42,15 +43,15 @@ class AlertEvent(QueryModel, table=True):
    __tablename__ = "alert_events"  # pyright: ignore[reportAssignmentType]
    id: UUID = Field(default_factory=uuid4, primary_key=True)
-    organization_id: UUID = Field(default=None, sa_column=Column(UUID, ForeignKey("organizations.id", ondelete="CASCADE"), index=True))
+    organization_id: UUID = Field(default=None, foreign_key="organizations.id", index=True)
-    rule_id: UUID = Field(default=None, sa_column=Column(UUID, ForeignKey("alert_rules.id", ondelete="CASCADE"), index=True))
+    rule_id: UUID = Field(default=None, foreign_key="alert_rules.id", index=True)
-    gateway_id: UUID | None = Field(default=None, sa_column=Column(UUID, ForeignKey("gateways.id", ondelete="CASCADE"), index=True))
+    gateway_id: UUID | None = Field(default=None, foreign_key="gateways.id", index=True)
    metric_type: str = Field(nullable=False, index=True)
    triggered_value: float = Field(nullable=False)
    threshold_value: float = Field(nullable=False)
    acknowledged: bool = Field(default=False)
    acknowledged_at: datetime | None = Field(default=None)
-    acknowledged_by: UUID | None = Field(default=None, sa_column=Column(UUID, ForeignKey("users.id", ondelete="SET NULL"), nullable=True))
+    acknowledged_by: UUID | None = Field(default=None, foreign_key="users.id", nullable=True)
    resolved_at: datetime | None = Field(default=None)
    metadata_: dict | None = Field(default=None, sa_column=Column(JSON))
    created_at: datetime = Field(default_factory=utcnow)
--- a/src/backend/app/models/monitoring.py
+++ b/src/backend/app/models/monitoring.py
@ -5,7 +5,7 @@ from __future__ import annotations
 from datetime import datetime
 from uuid import UUID, uuid4
-from sqlalchemy import JSON, Column, ForeignKey
+from sqlalchemy import JSON, Column, Index
 from sqlmodel import Field
 from app.core.time import utcnow
@ -20,8 +20,8 @@ class CostSnapshot(QueryModel, table=True):
    __tablename__ = "cost_snapshots"  # pyright: ignore[reportAssignmentType]
    id: UUID = Field(default_factory=uuid4, primary_key=True)
-    organization_id: UUID = Field(default=None, sa_column=Column(UUID, ForeignKey("organizations.id", ondelete="CASCADE"), index=True))
+    organization_id: UUID = Field(default=None, foreign_key="organizations.id", index=True)
-    gateway_id: UUID = Field(default=None, sa_column=Column(UUID, ForeignKey("gateways.id", ondelete="CASCADE"), index=True))
+    gateway_id: UUID = Field(default=None, foreign_key="gateways.id", index=True)
    period_start: datetime = Field(nullable=False)
    period_end: datetime = Field(nullable=False)
    total_cost: float = Field(nullable=False, default=0.0)
@ -44,8 +44,8 @@ class CronJobStatus(QueryModel, table=True):
    __tablename__ = "cron_job_statuses"  # pyright: ignore[reportAssignmentType]
    id: UUID = Field(default_factory=uuid4, primary_key=True)
-    organization_id: UUID = Field(default=None, sa_column=Column(UUID, ForeignKey("organizations.id", ondelete="CASCADE"), index=True))
+    organization_id: UUID = Field(default=None, foreign_key="organizations.id", index=True)
-    gateway_id: UUID = Field(default=None, sa_column=Column(UUID, ForeignKey("gateways.id", ondelete="CASCADE"), index=True))
+    gateway_id: UUID = Field(default=None, foreign_key="gateways.id", index=True)
    job_name: str = Field(nullable=False, index=True)
    schedule: str = Field(nullable=False)
    enabled: bool = Field(default=True)
@ -70,8 +70,8 @@ class SessionEvent(QueryModel, table=True):
    __tablename__ = "session_events"  # pyright: ignore[reportAssignmentType]
    id: UUID = Field(default_factory=uuid4, primary_key=True)
-    organization_id: UUID = Field(default=None, sa_column=Column(UUID, ForeignKey("organizations.id", ondelete="CASCADE"), index=True))
+    organization_id: UUID = Field(default=None, foreign_key="organizations.id", index=True)
-    gateway_id: UUID = Field(default=None, sa_column=Column(UUID, ForeignKey("gateways.id", ondelete="CASCADE"), index=True))
+    gateway_id: UUID = Field(default=None, foreign_key="gateways.id", index=True)
    session_key: str = Field(nullable=False, index=True)
    event_type: str = Field(nullable=False)
    model: str | None = Field(default=None)
@ -96,10 +96,10 @@ class SubAgentRun(QueryModel, table=True):
    __tablename__ = "sub_agent_runs"  # pyright: ignore[reportAssignmentType]
    id: UUID = Field(default_factory=uuid4, primary_key=True)
-    organization_id: UUID = Field(default=None, sa_column=Column(UUID, ForeignKey("organizations.id", ondelete="CASCADE"), index=True))
+    organization_id: UUID = Field(default=None, foreign_key="organizations.id", index=True)
-    gateway_id: UUID = Field(default=None, sa_column=Column(UUID, ForeignKey("gateways.id", ondelete="CASCADE"), index=True))
+    gateway_id: UUID = Field(default=None, foreign_key="gateways.id", index=True)
    parent_session_key: str = Field(nullable=False, index=True)
-    session_event_id: UUID | None = Field(default=None, sa_column=Column(UUID, ForeignKey("session_events.id", ondelete="CASCADE"), index=True))
+    session_event_id: UUID | None = Field(default=None, foreign_key="session_events.id", index=True)
    agent: str | None = Field(default=None)
    model: str | None = Field(default=None)
    status: str = Field(nullable=False, default="pending")
@ -122,8 +122,8 @@ class SystemHealthMetric(QueryModel, table=True):
    __tablename__ = "system_health_metrics"  # pyright: ignore[reportAssignmentType]
    id: UUID = Field(default_factory=uuid4, primary_key=True)
-    organization_id: UUID = Field(default=None, sa_column=Column(UUID, ForeignKey("organizations.id", ondelete="CASCADE"), index=True))
+    organization_id: UUID = Field(default=None, foreign_key="organizations.id", index=True)
-    gateway_id: UUID = Field(default=None, sa_column=Column(UUID, ForeignKey("gateways.id", ondelete="CASCADE"), index=True))
+    gateway_id: UUID = Field(default=None, foreign_key="gateways.id", index=True)
    cpu_percent: float | None = Field(default=None)
    cpu_cores: int | None = Field(default=None)
    ram_used_bytes: int | None = Field(default=None)
--- a/src/backend/app/services/monitoring/init.py
+++ b/src/backend/app/services/monitoring/init.py
@ -0,0 +1,10 @@
 """Background monitoring data collection services.
 This package provides the GatewayCollectorService which periodically polls
 OpenClaw Gateway RPC endpoints and stores the results in Mission Control's
 monitoring models (CostSnapshot, CronJobStatus, SessionEvent, etc.).
 """
 from app.services.monitoring.gateway_collector import GatewayCollectorService
 __all__ = ("GatewayCollectorService",)
--- a/src/backend/app/services/monitoring/gateway_collector.py
+++ b/src/backend/app/services/monitoring/gateway_collector.py
@ -0,0 +1,534 @@
 """Gateway data collection service for Mission Control monitoring.
 This module implements GatewayCollectorService, a background asyncio task that
 periodically polls OpenClaw Gateway RPC endpoints and stores results in the
 monitoring models (CostSnapshot, CronJobStatus, SessionEvent, etc.).
 """
 from __future__ import annotations
 import asyncio
 from typing import Any
 from app.core.logging import get_logger
 from app.db.session import async_session_maker
 from app.models.gateways import Gateway
 from app.models.monitoring import (
    CostSnapshot,
    CronJobStatus,
    SessionEvent,
    SubAgentRun,
    SystemHealthMetric,
 )
 from app.services.monitoring.models import (
    CostResponse,
    CronJobStatusResponse,
    GatewayHealthResponse,
    GatewayStatusResponse,
    SessionPreviewResponse,
    SessionsListResponse,
    UsageStatusResponse,
 )
 from app.services.openclaw.gateway_resolver import gateway_client_config
 from app.services.openclaw.gateway_rpc import (
    GatewayConfig as GatewayClientConfig,
    OpenClawGatewayError,
    openclaw_call,
 )
 from sqlmodel import select
 logger = get_logger(__name__)
 # Collection interval environment variables with defaults (seconds)
 # These control how frequently each RPC endpoint is polled per gateway
 COLLECTION_INTERVAL_COST = int(
    __import__("os").environ.get("COLLECTION_INTERVAL_COST", "300")
 )
 COLLECTION_INTERVAL_CRON = int(
    __import__("os").environ.get("COLLECTION_INTERVAL_CRON", "60")
 )
 COLLECTION_INTERVAL_SESSION = int(
    __import__("os").environ.get("COLLECTION_INTERVAL_SESSION", "30")
 )
 COLLECTION_INTERVAL_HEALTH = int(
    __import__("os").environ.get("COLLECTION_INTERVAL_HEALTH", "60")
 )
 class GatewayCollectorService:
    """Background service that polls gateway RPC endpoints and stores monitoring data.
    This service runs as a background asyncio task and periodically polls each
    registered gateway for cost, cron, session, and health data. It stores the
    results in Mission Control's monitoring models using an upsert pattern
    (insert or update) to avoid duplicates.
    """
    def __init__(self, session_factory: Any) -> None:
        # Use the factory to create sessions
        self._session_factory = session_factory
        self._shutdown_event = asyncio.Event()
        self._tasks: list[asyncio.Task[None]] = []
    async def run(self) -> None:
        """Start the collector as a long-running background task."""
        logger.info("GatewayCollectorService started")
        try:
            while not self._shutdown_event.is_set():
                await self._collect_all_gateways()
                # Wait for the shortest interval before next collection
                await asyncio.sleep(min(
                    COLLECTION_INTERVAL_COST,
                    COLLECTION_INTERVAL_CRON,
                    COLLECTION_INTERVAL_SESSION,
                    COLLECTION_INTERVAL_HEALTH,
                ))
        except asyncio.CancelledError:
            logger.info("GatewayCollectorService cancelled")
            raise
        finally:
            logger.info("GatewayCollectorService stopped")
    async def shutdown(self) -> None:
        """Signal the collector to stop running."""
        self._shutdown_event.set()
        # Wait for all gateway polling tasks to complete
        if self._tasks:
            await asyncio.gather(*self._tasks, return_exceptions=True)
            self._tasks.clear()
    async def _collect_all_gateways(self) -> None:
        """Fetch all gateways and poll each one concurrently."""
        # Use a new session for this collection cycle
        async with self._session_factory() as session:
            # Get all gateways from the database
            result = await session.execute(select(Gateway))
            gateways = result.scalars().all()
        if not gateways:
            logger.debug("No gateways registered, skipping collection cycle")
            return
        # Create a task group to poll all gateways concurrently
        async with asyncio.TaskGroup() as tg:
            for gateway in gateways:
                self._tasks.append(tg.create_task(
                    self._poll_gateway(gateway),
                    name=f"poll_gateway_{gateway.id.hex[:12]}",
                ))
    async def _poll_gateway(self, gateway: Gateway) -> None:
        """Poll a single gateway for all monitoring data."""
        config = gateway_client_config(gateway)
        logger.debug(
            "Polling gateway %s (%s)", gateway.id.hex[:12], gateway.name
        )
        # Create a task group for independent collection methods
        async with asyncio.TaskGroup() as tg:
            tg.create_task(self._collect_cost(gateway, config))
            tg.create_task(self._collect_cron(gateway, config))
            tg.create_task(self._collect_session(gateway, config))
            tg.create_task(self._collect_health(gateway, config))
    async def _collect_cost(
        self, gateway: Gateway, config: GatewayClientConfig
    ) -> None:
        """Collect cost snapshot data from gateway usage endpoints."""
        try:
            # Call usage.cost to get cost breakdown by model
            cost_result = await openclaw_call(
                "usage.cost", {}, config=config
            )
            cost_data = self._parse_cost_response(cost_result)
            # Call usage.status for token usage stats
            usage_result = await openclaw_call(
                "usage.status", {}, config=config
            )
            usage_data = self._parse_usage_status(usage_result)
            # Build and upsert CostSnapshot
            snapshot = CostSnapshot(
                organization_id=gateway.organization_id,
                gateway_id=gateway.id,
                period_start=cost_data.period_start,
                period_end=cost_data.period_end,
                total_cost=cost_data.total_cost,
                model_costs=cost_data.model_costs,
                provider_costs=cost_data.provider_costs,
                token_counts=usage_data.token_counts,
            )
            await self._upsert_cost_snapshot(snapshot)
            logger.debug(
                "Collected cost snapshot for gateway %s: total=$%.2f",
                gateway.id.hex[:12],
                snapshot.total_cost,
            )
        except OpenClawGatewayError as exc:
            logger.warning(
                "Failed to collect cost data from gateway %s: %s",
                gateway.id.hex[:12],
                exc,
            )
        except Exception as exc:
            logger.error(
                "Unexpected error collecting cost data from gateway %s: %s",
                gateway.id.hex[:12],
                exc,
                exc_info=True,
            )
    async def _collect_cron(
        self, gateway: Gateway, config: GatewayClientConfig
    ) -> None:
        """Collect cron job status data from gateway cron endpoints."""
        try:
            # Call cron.list to get all cron jobs
            cron_list_result = await openclaw_call(
                "cron.list", {}, config=config
            )
            cron_data = self._parse_cron_list_response(cron_list_result)
            # Upsert each cron job status
            for job_status in cron_data.jobs:
                status = CronJobStatus(
                    organization_id=gateway.organization_id,
                    gateway_id=gateway.id,
                    job_name=job_status.job_name,
                    schedule=job_status.schedule,
                    enabled=job_status.enabled,
                    last_run_at=job_status.last_run_at,
                    next_run_at=job_status.next_run_at,
                    status=job_status.status,
                    failure_count=job_status.failure_count,
                    last_error=job_status.last_error,
                    metadata_=job_status.metadata_,
                )
                await self._upsert_cron_job_status(status)
            logger.debug(
                "Collected %d cron jobs from gateway %s",
                len(cron_data.jobs),
                gateway.id.hex[:12],
            )
        except OpenClawGatewayError as exc:
            logger.warning(
                "Failed to collect cron data from gateway %s: %s",
                gateway.id.hex[:12],
                exc,
            )
        except Exception as exc:
            logger.error(
                "Unexpected error collecting cron data from gateway %s: %s",
                gateway.id.hex[:12],
                exc,
                exc_info=True,
            )
    async def _collect_session(
        self, gateway: Gateway, config: GatewayClientConfig
    ) -> None:
        """Collect session event data from gateway sessions endpoints."""
        try:
            # Call sessions.list to get all sessions
            sessions_list_result = await openclaw_call(
                "sessions.list", {}, config=config
            )
            sessions_data = self._parse_sessions_list_response(
                sessions_list_result
            )
            # Collect preview data for each session
            for session in sessions_data.sessions:
                try:
                    preview_result = await openclaw_call(
                        "sessions.preview",
                        {"key": session.session_key},
                        config=config,
                    )
                    preview_data = self._parse_session_preview(
                        preview_result
                    )
                    # Build and upsert SessionEvent
                    event = SessionEvent(
                        organization_id=gateway.organization_id,
                        gateway_id=gateway.id,
                        session_key=session.session_key,
                        event_type=preview_data.event_type,
                        model=preview_data.model,
                        agent_id=preview_data.agent_id,
                        channel=preview_data.channel,
                        context_percent=preview_data.context_percent,
                        token_counts=preview_data.token_counts,
                        cost=preview_data.cost,
                        metadata_=preview_data.metadata_,
                    )
                    await self._upsert_session_event(event)
                    logger.debug(
                        "Collected session event for %s from gateway %s",
                        session.session_key,
                        gateway.id.hex[:12],
                    )
                except OpenClawGatewayError as exc:
                    logger.warning(
                        "Failed to get preview for session %s from gateway %s: %s",
                        session.session_key,
                        gateway.id.hex[:12],
                        exc,
                    )
        except OpenClawGatewayError as exc:
            logger.warning(
                "Failed to collect sessions from gateway %s: %s",
                gateway.id.hex[:12],
                exc,
            )
        except Exception as exc:
            logger.error(
                "Unexpected error collecting sessions from gateway %s: %s",
                gateway.id.hex[:12],
                exc,
                exc_info=True,
            )
    async def _collect_health(
        self, gateway: Gateway, config: GatewayClientConfig
    ) -> None:
        """Collect system health metrics from gateway health/status endpoints."""
        try:
            # Call health endpoint
            health_result = await openclaw_call(
                "health", {}, config=config
            )
            health_data = self._parse_health_response(health_result)
            # Call status endpoint
            status_result = await openclaw_call(
                "status", {}, config=config
            )
            status_data = self._parse_status_response(status_result)
            # Build and upsert SystemHealthMetric
            metric = SystemHealthMetric(
                organization_id=gateway.organization_id,
                gateway_id=gateway.id,
                cpu_percent=health_data.cpu_percent,
                cpu_cores=health_data.cpu_cores,
                ram_used_bytes=health_data.ram_used_bytes,
                ram_total_bytes=health_data.ram_total_bytes,
                ram_percent=health_data.ram_percent,
                swap_used_bytes=health_data.swap_used_bytes,
                swap_total_bytes=health_data.swap_total_bytes,
                swap_percent=health_data.swap_percent,
                disk_path=health_data.disk_path,
                disk_used_bytes=health_data.disk_used_bytes,
                disk_total_bytes=health_data.disk_total_bytes,
                disk_percent=health_data.disk_percent,
                gateway_live=status_data.gateway_live,
                gateway_ready=status_data.gateway_ready,
                gateway_uptime_ms=status_data.gateway_uptime_ms,
                gateway_pid=status_data.gateway_pid,
                gateway_version=status_data.gateway_version,
                metadata_=health_data.metadata_,
            )
            await self._upsert_health_metric(metric)
            logger.debug(
                "Collected health metrics for gateway %s",
                gateway.id.hex[:12],
            )
        except OpenClawGatewayError as exc:
            logger.warning(
                "Failed to collect health data from gateway %s: %s",
                gateway.id.hex[:12],
                exc,
            )
        except Exception as exc:
            logger.error(
                "Unexpected error collecting health data from gateway %s: %s",
                gateway.id.hex[:12],
                exc,
                exc_info=True,
            )
    async def _upsert_cost_snapshot(self, snapshot: CostSnapshot) -> None:
        """Upsert a cost snapshot by org+gateway+period."""
        async with self._session_factory() as session:
            # Find existing snapshot for this period
            stmt = select(CostSnapshot).where(
                CostSnapshot.organization_id == snapshot.organization_id,
                CostSnapshot.gateway_id == snapshot.gateway_id,
                CostSnapshot.period_start == snapshot.period_start,
                CostSnapshot.period_end == snapshot.period_end,
            )
            result = await session.execute(stmt)
            existing = result.scalar_one_or_none()
            if existing:
                # Update existing
                existing.total_cost = snapshot.total_cost
                existing.model_costs = snapshot.model_costs
                existing.provider_costs = snapshot.provider_costs
                existing.token_counts = snapshot.token_counts
                existing.updated_at = snapshot.updated_at
                session.add(existing)
            else:
                # Insert new
                session.add(snapshot)
            await session.commit()
    async def _upsert_cron_job_status(
        self, status: CronJobStatus
    ) -> None:
        """Upsert cron job status by org+gateway+job_name."""
        async with self._session_factory() as session:
            # Find existing status for this job
            stmt = select(CronJobStatus).where(
                CronJobStatus.organization_id == status.organization_id,
                CronJobStatus.gateway_id == status.gateway_id,
                CronJobStatus.job_name == status.job_name,
            )
            result = await session.execute(stmt)
            existing = result.scalar_one_or_none()
            if existing:
                # Update existing
                existing.schedule = status.schedule
                existing.enabled = status.enabled
                existing.last_run_at = status.last_run_at
                existing.next_run_at = status.next_run_at
                existing.status = status.status
                existing.failure_count = status.failure_count
                existing.last_error = status.last_error
                existing.metadata_ = status.metadata_
                existing.updated_at = status.updated_at
                session.add(existing)
            else:
                # Insert new
                session.add(status)
            await session.commit()
    async def _upsert_session_event(
        self, event: SessionEvent
    ) -> None:
        """Upsert session event by org+gateway+session_key."""
        async with self._session_factory() as session:
            # Find existing event for this session
            stmt = select(SessionEvent).where(
                SessionEvent.organization_id == event.organization_id,
                SessionEvent.gateway_id == event.gateway_id,
                SessionEvent.session_key == event.session_key,
            )
            result = await session.execute(stmt)
            existing = result.scalar_one_or_none()
            if existing:
                # Update existing
                existing.event_type = event.event_type
                existing.model = event.model
                existing.agent_id = event.agent_id
                existing.channel = event.channel
                existing.context_percent = event.context_percent
                existing.token_counts = event.token_counts
                existing.cost = event.cost
                existing.metadata_ = event.metadata_
                existing.updated_at = event.updated_at
                session.add(existing)
            else:
                # Insert new
                session.add(event)
            await session.commit()
    async def _upsert_health_metric(
        self, metric: SystemHealthMetric
    ) -> None:
        """Upsert health metric by org+gateway+collected_at."""
        async with self._session_factory() as session:
            # Find existing metric for this collection time
            stmt = select(SystemHealthMetric).where(
                SystemHealthMetric.organization_id == metric.organization_id,
                SystemHealthMetric.gateway_id == metric.gateway_id,
                SystemHealthMetric.collected_at == metric.collected_at,
            )
            result = await session.execute(stmt)
            existing = result.scalar_one_or_none()
            if existing:
                # Update existing
                existing.cpu_percent = metric.cpu_percent
                existing.cpu_cores = metric.cpu_cores
                existing.ram_used_bytes = metric.ram_used_bytes
                existing.ram_total_bytes = metric.ram_total_bytes
                existing.ram_percent = metric.ram_percent
                existing.swap_used_bytes = metric.swap_used_bytes
                existing.swap_total_bytes = metric.swap_total_bytes
                existing.swap_percent = metric.swap_percent
                existing.disk_path = metric.disk_path
                existing.disk_used_bytes = metric.disk_used_bytes
                existing.disk_total_bytes = metric.disk_total_bytes
                existing.disk_percent = metric.disk_percent
                existing.gateway_live = metric.gateway_live
                existing.gateway_ready = metric.gateway_ready
                existing.gateway_uptime_ms = metric.gateway_uptime_ms
                existing.gateway_pid = metric.gateway_pid
                existing.gateway_version = metric.gateway_version
                existing.metadata_ = metric.metadata_
                existing.updated_at = metric.updated_at
                session.add(existing)
            else:
                # Insert new
                session.add(metric)
            await session.commit()
    # --- Response Parsers ---
    def _parse_cost_response(self, raw: object) -> CostResponse:
        """Parse usage.cost RPC response into typed schema."""
        if not isinstance(raw, dict):
            raise ValueError("Expected dict for cost response")
        return CostResponse.model_validate(raw)
    def _parse_usage_status(self, raw: object) -> UsageStatusResponse:
        """Parse usage.status RPC response into typed schema."""
        if not isinstance(raw, dict):
            raise ValueError("Expected dict for usage status response")
        return UsageStatusResponse.model_validate(raw)
    def _parse_cron_list_response(
        self, raw: object
    ) -> CronJobStatusResponse:
        """Parse cron.list RPC response into typed schema."""
        if not isinstance(raw, dict):
            raise ValueError("Expected dict for cron list response")
        return CronJobStatusResponse.model_validate(raw)
    def _parse_sessions_list_response(
        self, raw: object
    ) -> SessionsListResponse:
        """Parse sessions.list RPC response into typed schema."""
        if not isinstance(raw, dict):
            raise ValueError("Expected dict for sessions list response")
        return SessionsListResponse.model_validate(raw)
    def _parse_session_preview(self, raw: object) -> SessionPreviewResponse:
        """Parse sessions.preview RPC response into typed schema."""
        if not isinstance(raw, dict):
            raise ValueError("Expected dict for session preview response")
        return SessionPreviewResponse.model_validate(raw)
    def _parse_health_response(self, raw: object) -> GatewayHealthResponse:
        """Parse health RPC response into typed schema."""
        if not isinstance(raw, dict):
            raise ValueError("Expected dict for health response")
        return GatewayHealthResponse.model_validate(raw)
    def _parse_status_response(self, raw: object) -> GatewayStatusResponse:
        """Parse status RPC response into typed schema."""
        if not isinstance(raw, dict):
            raise ValueError("Expected dict for status response")
        return GatewayStatusResponse.model_validate(raw)
 async def get_collector_service() -> GatewayCollectorService:
    """Create and return a GatewayCollectorService instance."""
    return GatewayCollectorService(async_session_maker)
--- a/src/backend/app/services/monitoring/models.py
+++ b/src/backend/app/services/monitoring/models.py
@ -0,0 +1,174 @@
 """Pydantic schemas for OpenClaw Gateway RPC response parsing.
 These schemas map the raw JSON responses from gateway RPC methods to typed
 Python objects. They are used by GatewayCollectorService to parse responses
 before storing data in Mission Control's monitoring models.
 Note: These are NOT DB models (those are in app/models/monitoring.py).
 These are purely for RPC response parsing.
 """
 from __future__ import annotations
 from datetime import datetime
 from typing import Any
 from pydantic import BaseModel, Field
 class CostPeriod(BaseModel):
    """Cost period breakdown."""
    start: datetime
    end: datetime
    total: float = Field(default=0.0)
    model_costs: dict[str, float] | None = Field(default=None, alias="models")
    provider_costs: dict[str, float] | None = Field(default=None)
    token_counts: dict[str, int] | None = Field(default=None)
 class CostResponse(BaseModel):
    """Response from usage.cost RPC method."""
    period_start: datetime = Field(alias="period_start")
    period_end: datetime = Field(alias="period_end")
    total_cost: float = Field(alias="total", default=0.0)
    model_costs: dict[str, float] | None = Field(alias="models", default=None)
    provider_costs: dict[str, float] | None = Field(default=None)
    token_counts: dict[str, int] | None = Field(alias="tokens", default=None)
 class UsageStatusResponse(BaseModel):
    """Response from usage.status RPC method."""
    tokens_used: int = Field(default=0)
    tokens_limit: int | None = Field(default=None)
    cost_used: float = Field(default=0.0)
    cost_limit: float | None = Field(default=None)
    model_usage: dict[str, dict[str, Any]] | None = Field(default=None)
 class CronJobStatus(BaseModel):
    """Individual cron job status."""
    job_name: str = Field(alias="name")
    schedule: str
    enabled: bool = Field(default=True)
    last_run_at: datetime | None = Field(default=None, alias="lastRun")
    next_run_at: datetime | None = Field(default=None, alias="nextRun")
    status: str = Field(default="idle")
    failure_count: int = Field(default=0, alias="failureCount")
    last_error: str | None = Field(default=None, alias="lastError")
    metadata_: dict[str, Any] | None = Field(default=None, alias="metadata")
    model_config = {"populate_by_name": True}
 class CronJobStatusResponse(BaseModel):
    """Response from cron.list RPC method."""
    jobs: list[CronJobStatus] = Field(default_factory=list)
 class SessionPreview(BaseModel):
    """Session preview data."""
    session_key: str = Field(alias="key")
    event_type: str = Field(default="unknown", alias="eventType")
    model: str | None = Field(default=None)
    agent_id: str | None = Field(default=None, alias="agentId")
    channel: str | None = Field(default=None)
    context_percent: float | None = Field(default=None, alias="contextPct")
    token_counts: dict[str, int] | None = Field(default=None)
    cost: float | None = Field(default=None)
    metadata_: dict[str, Any] | None = Field(default=None, alias="metadata")
    model_config = {"populate_by_name": True}
 class SessionsListResponse(BaseModel):
    """Response from sessions.list RPC method."""
    sessions: list[SessionPreview] = Field(default_factory=list)
 class SessionPreviewResponse(BaseModel):
    """Response from sessions.preview RPC method."""
    session_key: str = Field(alias="key")
    event_type: str = Field(default="preview", alias="eventType")
    model: str | None = Field(default=None)
    agent_id: str | None = Field(default=None, alias="agentId")
    channel: str | None = Field(default=None)
    context_percent: float | None = Field(default=None, alias="contextPct")
    token_counts: dict[str, int] | None = Field(default=None)
    cost: float | None = Field(default=None)
    metadata_: dict[str, Any] | None = Field(default=None, alias="metadata")
    model_config = {"populate_by_name": True}
 class GatewayHealthMetrics(BaseModel):
    """System health metrics from gateway health response."""
    cpu_percent: float | None = Field(default=None, alias="cpu")
    cpu_cores: int | None = Field(default=None, alias="cpuCores")
    ram_used_bytes: int | None = Field(default=None, alias="ramUsed")
    ram_total_bytes: int | None = Field(default=None, alias="ramTotal")
    ram_percent: float | None = Field(default=None, alias="ramPercent")
    swap_used_bytes: int | None = Field(default=None, alias="swapUsed")
    swap_total_bytes: int | None = Field(default=None, alias="swapTotal")
    swap_percent: float | None = Field(default=None, alias="swapPercent")
    disk_path: str = Field(default="/", alias="diskPath")
    disk_used_bytes: int | None = Field(default=None, alias="diskUsed")
    disk_total_bytes: int | None = Field(default=None, alias="diskTotal")
    disk_percent: float | None = Field(default=None, alias="diskPercent")
    metadata_: dict[str, Any] | None = Field(default=None, alias="metadata")
    model_config = {"populate_by_name": True}
 class GatewayHealthResponse(BaseModel):
    """Response from health RPC method."""
    status: str = Field(default="unknown")
    pid: int | None = Field(default=None)
    uptime_ms: int | None = Field(default=None, alias="uptimeMs")
    memory_bytes: int | None = Field(default=None, alias="memory")
    rss_bytes: int | None = Field(default=None, alias="rss")
    timestamp: datetime | None = Field(default=None)
    metrics: GatewayHealthMetrics | None = Field(default=None)
 class GatewayStatus(BaseModel):
    """Gateway runtime status."""
    gateway_live: bool = Field(default=False, alias="live")
    gateway_ready: bool = Field(default=False, alias="ready")
    gateway_uptime_ms: int = Field(default=0, alias="uptimeMs")
    gateway_pid: int = Field(default=0, alias="pid")
    gateway_version: str = Field(default="unknown", alias="version")
    agents_count: int = Field(default=0, alias="agents")
 class GatewayStatusResponse(BaseModel):
    """Response from status RPC method."""
    status: str = Field(default="unknown")
    gateway: GatewayStatus
 class SubAgentRun(BaseModel):
    """Sub-agent execution record."""
    parent_session_key: str = Field(alias="parentSessionKey")
    session_event_id: str | None = Field(default=None, alias="sessionId")
    agent: str | None = Field(default=None)
    model: str | None = Field(default=None)
    status: str = Field(default="pending")
    duration_ms: int | None = Field(default=None, alias="durationMs")
    cost: float | None = Field(default=None)
    token_counts: dict[str, int] | None = Field(default=None)
    metadata_: dict[str, Any] | None = Field(default=None, alias="metadata")
    model_config = {"populate_by_name": True}