Mission-Control/PROJECT.md

# 🎯 Mission Control — Project Plan

> Merge three OpenClaw dashboards into a single, unified Mission Control platform.

---

## Source Repos

| Repo | Purpose | Stack | Key Assets |
|---|---|---|---|
| [abhi1693/openclaw-mission-control](https://github.com/abhi1693/openclaw-mission-control) | **Base platform** — work orchestration, governance, gateway management | Python/FastAPI + PostgreSQL + Redis + Next.js (React 19) + Clerk auth + Docker Compose | Organizations, boards, tasks, tags, approvals, agents, gateways, webhooks, activity feed, skills marketplace |
| [mudrii/openclaw-dashboard](https://github.com/mudrii/openclaw-dashboard) | **Tracking layer** — real-time metrics, costs, crons, sessions, system health | Go binary (zero deps) + embedded HTML/JS + SVG charts | Cost cards, cron status, session tracking, sub-agent activity, AI chat, system metrics (CPU/RAM/disk), 6 themes, alerts, token usage |
| [jaffer1979/openclaw-pixel-agents-dashboard](https://github.com/jaffer1979/openclaw-pixel-agents-dashboard) | **Agent visualization** — pixel-art agent sprites, real-time activity | Node/Express + Vite + React 19 + Canvas/WebSocket + JSONL parsing | Agent sprites with activity bubbles, conversation heat, spawn sub-agents, hardware monitor, service controls, day/night cycle |

---

## Architecture Decision: What to Merge Into What

**Base: openclaw-mission-control** — this becomes the foundation because:
- It has the richest data model (organizations, boards, tasks, approvals, agents, gateways, webhooks)
- It has proper auth (Clerk or local bearer token)
- It has a full API layer (FastAPI with SQLModel/SQLAlchemy)
- It has multi-tenancy built in
- It has the most mature frontend (Next.js 16 + React 19 + TanStack Query + Recharts)

**Merge FROM dashboard** — extract the tracking/monitoring features:
- Cost tracking, token usage, model breakdown
- Cron job status, scheduling, last/next run
- Session tracking, sub-agent activity
- System health (CPU, RAM, disk, gateway status)
- AI chat panel (ask questions about your data)
- Alert system (high cost, failed crons, context usage)
- 6 themes + glass morphism UI

**Merge FROM pixel-agents** — extract the agent visualization:
- Pixel-art agent sprites in a shared office scene
- Real-time activity bubbles, conversation heat
- Sub-agent spawning from the UI
- Hardware monitor (CPU/GPU/RAM/disk/network)
- Service controls (start/stop/restart gateway)
- Day/night cycle ambient lighting

---

## Technical Analysis

### Base Platform (openclaw-mission-control)

**Backend:**
- Python 3.12+, FastAPI, SQLModel/SQLAlchemy, PostgreSQL, Redis
- Alembic migrations, RQ worker for webhooks
- Full OpenClaw gateway integration via WebSocket RPC (device pairing, control UI)
- Gateway methods: 60+ RPC calls for sessions, agents, cron, config, exec approvals, etc.
- Auth: Clerk JWT or local bearer token (≥50 chars)

**Frontend:**
- Next.js 16.1.7, React 19.2, TanStack Query v5, TanStack Table v8
- Radix UI primitives, Tailwind CSS, Recharts, React Markdown
- 40+ page routes (dashboard, boards, agents, approvals, gateways, skills, tags, etc.)
- Cypress E2E tests

**Data Model (27 tables):**
- Organizations, users, boards, board_groups, tasks, tags, approvals
- Agents, gateways, activity_events, board_webhooks, skills
- Custom fields, task dependencies, task fingerprints
- Board memory, board group memory, onboarding

**What it LACKS that the others have:**
- No real-time cost/token tracking
- No system health monitoring (CPU/RAM/disk)
- No cron job visualization
- No session/sub-agent activity monitoring
- No AI chat for asking about your deployment
- No pixel-art agent visualization
- No hardware monitoring
- No service controls (start/stop/restart gateway)

### Dashboard (openclaw-dashboard) — What We Pull

**Data Collection (Go):**
- `refresh.go` — main collector, reads OpenClaw filesystem + gateway API
- `refresh_sessions.go` — session listing, model resolution
- `refresh_tokens.go` — token usage tracking
- `cron_state` — cron job parsing and status
- `system.go` — CPU, RAM, swap, disk, gateway runtime probes

**API Endpoints:**
- `/api/refresh` — stale-while-revalidate data.json
- `/api/chat` — AI chat via OpenClaw gateway
- `/api/system` — live host metrics
- `/api/logs` — merged log tail
- `/api/errors` — aggregated error feed

**Frontend:**
- Pure HTML/CSS/JS (single `index.html`) — we'll rewrite as React components
- State management: 7 plain objects (State, DataLayer, DirtyChecker, Renderer, Theme, Chat, App)
- SVG chart rendering (cost trends, model breakdown, sub-agent activity)
- 6 themes with 19 CSS color variables each

**Integration Approach:**
- Port the Go data collection to Python services that hit the OpenClaw gateway API
- Replace the embedded HTML frontend with React components in the Next.js app
- Use the existing gateway RPC connection in Mission Control's backend
- Add PostgreSQL models for tracking data (cost snapshots, cron states, session events)

### Pixel Agents (openclaw-pixel-agents-dashboard) — What We Pull

**Backend (Node/Express):**
- `sessionWatcher.ts` — tails JSONL session files, parses events
- `spawner.ts` — spawns sub-agents via gateway API
- `services.ts` — gateway service controls (start/stop/restart)
- `hardware.ts` — hardware stats collection
- `openclawParser.ts` — JSONL event parsing
- WebSocket broadcasting to frontend

**Frontend (React/Vite):**
- Pixel-art canvas renderer (`OfficeCanvas.tsx`, game loop, character sprites)
- Activity bubbles, conversation heat overlays
- Spawn chat panel, session info panel
- Server rack (hardware monitor), breaker panel (service controls)
- Ham radio (update checker), fire alarm (gateway restart)

**Integration Approach:**
- Port JSONL session watcher to Python (watch OpenClaw session directory)
- Move sub-agent spawning to use Mission Control's existing gateway RPC
- Rebuild the pixel-art canvas as a React component within Next.js
- Add WebSocket support to FastAPI for real-time agent events
- Hardware stats collected via the gateway's `health` and `status` methods

---

## Implementation Plan

### Phase 1: Foundation Setup (Week 1)

**1.1 — Fork and Stand Up Base**
- Fork `abhi1693/openclaw-mission-control` to our org
- Stand up local dev environment (Docker Compose: Postgres + Redis + backend + frontend)
- Verify all existing features work: auth, boards, tasks, agents, gateways, approvals
- Document the data model and API surface

**1.2 — Add Tracking Models (Backend)**
- Create new PostgreSQL models:
  - `CostSnapshot` — daily cost tracking per model/gateway
  - `CronJobStatus` — cron schedule, last/next run, duration, status
  - `SessionEvent` — session start/stop, model, tokens, context %
  - `SubAgentRun` — sub-agent spawn, cost, duration, status
  - `SystemHealthMetric` — CPU, RAM, disk, swap, gateway uptime
  - `AlertRule` — configurable alert thresholds
- Create Alembic migration
- Add CRUD API endpoints under `/api/monitoring/`

**1.3 — Gateway Data Collection Service**
- Create `app/services/monitoring/gateway_collector.py`
- Reuse existing `gateway_rpc.py` to poll:
  - `usage.cost` — cost data
  - `usage.status` — token counts
  - `cron.list` / `cron.status` — cron jobs
  - `sessions.list` / `sessions.preview` — sessions
  - `agents.list` — agents
  - `health` — gateway health
  - `status` — gateway runtime status
- Run as background task (asyncio) with configurable intervals
- Store collected data in the new models

### Phase 2: Tracking Dashboard (Week 2)

**2.1 — Monitoring Pages (Frontend)**
- New Next.js routes:
  - `/monitoring` — main dashboard (cost cards, system health, alerts)
  - `/monitoring/costs` — detailed cost breakdown with charts
  - `/monitoring/sessions` — active sessions, sub-agent activity
  - `/monitoring/crons` — cron job management
  - `/monitoring/system` — CPU/RAM/disk/gateway health

**2.2 — Cost Tracking UI**
- Port dashboard's cost cards and donut chart to React/Recharts
- Today's cost, all-time cost, projected monthly
- Per-model cost breakdown (7d/30d/all-time tabs)
- Cost trend line chart (SVG → Recharts)

**2.3 — Session & Sub-Agent UI**
- Active sessions with model, type badges (DM/group/cron/subagent)
- Context % bars, token counts
- Sub-agent activity grid with cost/duration/status
- Session detail panel with conversation preview

**2.4 — Cron Job Management**
- Cron job list with schedule, status, last/next run
- Run history with duration and status badges
- Trigger manual run from UI
- Add/edit/delete cron jobs (using existing gateway RPC)

**2.5 — System Health**
- Gateway status card (uptime, PID, memory, compaction)
- CPU/RAM/swap/disk gauge cards (configurable thresholds)
- Alert banner for high cost, failed crons, gateway offline
- Auto-refresh with countdown timer

**2.6 — AI Chat Panel**
- Port dashboard's AI chat to React component
- Uses OpenClaw gateway's `/v1/chat/completions` endpoint
- Context-aware: feed live monitoring data into system prompt
- Persistent chat history per user

### Phase 3: Agent Visualization (Week 3)

**3.1 — Pixel Agent Canvas**
- Port the pixel-art office scene to React (Canvas component)
- Agent sprites with activity state (working, idle, talking)
- Activity bubbles showing current task/conversation
- Conversation heat glow based on recent activity
- Day/night ambient cycle
- Pan/zoom controls (touch + mouse)

**3.2 — Real-Time Agent Events**
- Add FastAPI WebSocket endpoint (`/ws/agents`)
- Port JSONL session watcher to Python:
  - Watch `~/.openclaw/agents/*/sessions/*.jsonl`
  - Parse events (tool calls, responses, status changes)
  - Broadcast to connected WebSocket clients
- Activity ticker component (recent agent actions scrolling by)

**3.3 — Sub-Agent Spawner**
- Spawn panel integrated into the canvas view
- Click agent → "Spawn sub-agent" button
- Mini-chat for tasking the sub-agent
- Session info panel for active sub-agents
- Uses existing `agents.create` gateway RPC

**3.4 — Hardware Monitor & Service Controls**
- Server rack component (CPU/GPU/RAM/disk/network gauges)
- Breaker panel for gateway start/stop/restart
- Ham radio component for OpenClaw update checking
- All using existing gateway RPC methods (`health`, `status`, `update.run`)

### Phase 4: Integration & Polish (Week 4)

**4.1 — Navigation Integration**
- Add "Monitoring" and "Agents" sections to Mission Control sidebar
- Dashboard home page shows summary cards (cost, health, agent count)
- Deep links from monitoring → agents → pixel view

**4.2 — Theme System**
- Port the 6 dashboard themes into Mission Control's Tailwind config
- Theme picker in header (persists via localStorage)
- Glass morphism effects where appropriate

**4.3 — Alert System**
- Configurable alert rules (cost threshold, cron failure, context %, memory)
- Alert banner on every page when active
- Alert history in activity feed
- Notification delivery via webhooks or in-app

**4.4 — Data Sync Strategy**
- Primary: Gateway RPC polling (configurable intervals)
- Secondary: JSONL file watching for real-time agent events
- Tertiary: REST API for manual refresh
- WebSocket push for live updates to connected browsers
- Stale-while-revalidate caching pattern

---

## File Structure (Additions to Mission Control)

```
backend/
├── app/
│   ├── models/
│   │   ├── monitoring.py          # CostSnapshot, CronJobStatus, SessionEvent, etc.
│   │   └── alert_rules.py         # AlertRule model
│   ├── api/
│   │   ├── monitoring.py          # Cost, session, cron endpoints
│   │   ├── monitoring_system.py   # System health endpoints
│   │   └── agent_events.py        # WebSocket endpoint for agent events
│   └── services/
│       ├── monitoring/
│       │   ├── gateway_collector.py   # Polls OpenClaw gateway for data
│       │   ├── jsonl_watcher.py       # Watches session JSONL files
│       │   ├── cost_tracker.py        # Cost aggregation and projection
│       │   └── alert_engine.py        # Alert rule evaluation
│       └── openclaw/
│           └── (existing — no changes needed)
├── migrations/
│   └── versions/
│       └── xxx_add_monitoring_models.py
frontend/
├── src/
│   ├── app/
│   │   ├── monitoring/
│   │   │   ├── page.tsx              # Main monitoring dashboard
│   │   │   ├── costs/page.tsx        # Cost detail page
│   │   │   ├── sessions/page.tsx     # Session detail page
│   │   │   ├── crons/page.tsx        # Cron management page
│   │   │   └── system/page.tsx       # System health page
│   │   └── agents/
│   │       └── pixel/page.tsx        # Pixel agent canvas page
│   ├── components/
│   │   ├── monitoring/
│   │   │   ├── CostCards.tsx
│   │   │   ├── CostTrendChart.tsx
│   │   │   ├── ModelBreakdownChart.tsx
│   │   │   ├── SessionTable.tsx
│   │   │   ├── SubAgentActivity.tsx
│   │   │   ├── CronJobList.tsx
│   │   │   ├── SystemHealthCards.tsx
│   │   │   ├── AlertBanner.tsx
│   │   │   └── AiChatPanel.tsx
│   │   ├── agents/
│   │   │   ├── PixelCanvas.tsx
│   │   │   ├── AgentSprite.tsx
│   │   │   ├── ActivityBubble.tsx
│   │   │   ├── ConversationHeat.tsx
│   │   │   ├── SpawnPanel.tsx
│   │   │   ├── ServerRack.tsx
│   │   │   └── BreakerPanel.tsx
│   │   └── (existing Mission Control components)
│   └── lib/
│       ├── monitoring-api.ts         # API client for monitoring endpoints
│       └── agent-events.ts           # WebSocket client for agent events
```

---

## Key Integration Points

### Gateway Communication
All three projects talk to the OpenClaw gateway. Mission Control already has the richest integration (`gateway_rpc.py` with 60+ methods). We reuse this for everything:

| Feature | Gateway Methods Used |
|---|---|
| Cost tracking | `usage.cost`, `usage.status` |
| Session monitoring | `sessions.list`, `sessions.preview` |
| Cron management | `cron.list`, `cron.status`, `cron.add`, `cron.update`, `cron.remove`, `cron.run` |
| Agent management | `agents.list`, `agents.create`, `agents.update`, `agents.delete` |
| System health | `health`, `status`, `logs.tail` |
| Sub-agent spawning | `agents.create`, `sessions.patch` |
| Service controls | `config.get`, `config.set`, `update.run` |

### Real-Time Updates
- Dashboard uses polling (60s auto-refresh)
- Pixel agents uses WebSocket (real-time JSONL events)
- Mission Control uses TanStack Query (polling + cache invalidation)

**Our approach:** WebSocket for agent events (real-time pixel animation), TanStack Query with 30s polling for monitoring data, SSE for alerts.

### Auth
- Mission Control supports Clerk JWT and local bearer token
- Dashboard is auth-free (localhost only)
- Pixel agents uses gateway token

**Our approach:** Inherit Mission Control's auth system. Local mode for self-hosted, Clerk for multi-tenant. Monitoring and agent data scoped to organization + gateway.

---

## Dependency Summary

| Layer | Technology | Source |
|---|---|---|
| Backend framework | FastAPI + SQLModel | Mission Control |
| Database | PostgreSQL + Alembic | Mission Control |
| Job queue | Redis + RQ | Mission Control |
| Frontend framework | Next.js 16 + React 19 | Mission Control |
| UI primitives | Radix UI + Tailwind | Mission Control |
| Charts | Recharts (existing) | Mission Control |
| Pixel canvas | HTML5 Canvas (new) | Pixel Agents → React port |
| WebSocket | FastAPI WebSocket (new) | Pixel Agents → Python port |
| Auth | Clerk / local bearer token | Mission Control |
| Gateway RPC | websockets Python (existing) | Mission Control |

**No new backend languages.** Go and Node/Express are NOT added — their functionality ports to Python services within the existing FastAPI app.

---

## Risk Assessment

| Risk | Impact | Mitigation |
|---|---|---|
| Canvas rendering performance in React | Medium | Use `useRef` + `requestAnimationFrame`, not React state for animation |
| Go dashboard data collection rewritten in Python | Medium | Port logic faithfully; test against same OpenClaw data |
| JSONL file watching reliability | Medium | Use `watchdog` library + fallback polling |
| Theme system merge (6 themes × 2 systems) | Low | Map dashboard's 19 CSS vars to Tailwind config |
| Pixel assets licensing | Low | MIT licensed, attribution in ASSET-LICENSE.md |
| Gateway RPC version compatibility | Low | Already handled by protocol version negotiation in `gateway_rpc.py` |

---

## Success Metrics

1. **All monitoring features** from dashboard available in Mission Control UI
2. **Pixel agent visualization** showing real-time agent activity
3. **Single Docker Compose** brings up the entire system
4. **Single auth system** — no separate logins
5. **Single gateway connection** — reused across all features
6. **No Go or Node backend** — everything in Python/FastAPI
7. **All existing Mission Control features** still work (boards, tasks, approvals, etc.)