Mission-Control/PROJECT.md

396 lines
17 KiB
Markdown
Raw Normal View History

# 🎯 Mission Control — Project Plan
> Merge three OpenClaw dashboards into a single, unified Mission Control platform.
---
## Source Repos
| Repo | Purpose | Stack | Key Assets |
|---|---|---|---|
| [abhi1693/openclaw-mission-control](https://github.com/abhi1693/openclaw-mission-control) | **Base platform** — work orchestration, governance, gateway management | Python/FastAPI + PostgreSQL + Redis + Next.js (React 19) + Clerk auth + Docker Compose | Organizations, boards, tasks, tags, approvals, agents, gateways, webhooks, activity feed, skills marketplace |
| [mudrii/openclaw-dashboard](https://github.com/mudrii/openclaw-dashboard) | **Tracking layer** — real-time metrics, costs, crons, sessions, system health | Go binary (zero deps) + embedded HTML/JS + SVG charts | Cost cards, cron status, session tracking, sub-agent activity, AI chat, system metrics (CPU/RAM/disk), 6 themes, alerts, token usage |
| [jaffer1979/openclaw-pixel-agents-dashboard](https://github.com/jaffer1979/openclaw-pixel-agents-dashboard) | **Agent visualization** — pixel-art agent sprites, real-time activity | Node/Express + Vite + React 19 + Canvas/WebSocket + JSONL parsing | Agent sprites with activity bubbles, conversation heat, spawn sub-agents, hardware monitor, service controls, day/night cycle |
---
## Architecture Decision: What to Merge Into What
**Base: openclaw-mission-control** — this becomes the foundation because:
- It has the richest data model (organizations, boards, tasks, approvals, agents, gateways, webhooks)
- It has proper auth (Clerk or local bearer token)
- It has a full API layer (FastAPI with SQLModel/SQLAlchemy)
- It has multi-tenancy built in
- It has the most mature frontend (Next.js 16 + React 19 + TanStack Query + Recharts)
**Merge FROM dashboard** — extract the tracking/monitoring features:
- Cost tracking, token usage, model breakdown
- Cron job status, scheduling, last/next run
- Session tracking, sub-agent activity
- System health (CPU, RAM, disk, gateway status)
- AI chat panel (ask questions about your data)
- Alert system (high cost, failed crons, context usage)
- 6 themes + glass morphism UI
**Merge FROM pixel-agents** — extract the agent visualization:
- Pixel-art agent sprites in a shared office scene
- Real-time activity bubbles, conversation heat
- Sub-agent spawning from the UI
- Hardware monitor (CPU/GPU/RAM/disk/network)
- Service controls (start/stop/restart gateway)
- Day/night cycle ambient lighting
---
## Technical Analysis
### Base Platform (openclaw-mission-control)
**Backend:**
- Python 3.12+, FastAPI, SQLModel/SQLAlchemy, PostgreSQL, Redis
- Alembic migrations, RQ worker for webhooks
- Full OpenClaw gateway integration via WebSocket RPC (device pairing, control UI)
- Gateway methods: 60+ RPC calls for sessions, agents, cron, config, exec approvals, etc.
- Auth: Clerk JWT or local bearer token (≥50 chars)
**Frontend:**
- Next.js 16.1.7, React 19.2, TanStack Query v5, TanStack Table v8
- Radix UI primitives, Tailwind CSS, Recharts, React Markdown
- 40+ page routes (dashboard, boards, agents, approvals, gateways, skills, tags, etc.)
- Cypress E2E tests
**Data Model (27 tables):**
- Organizations, users, boards, board_groups, tasks, tags, approvals
- Agents, gateways, activity_events, board_webhooks, skills
- Custom fields, task dependencies, task fingerprints
- Board memory, board group memory, onboarding
**What it LACKS that the others have:**
- No real-time cost/token tracking
- No system health monitoring (CPU/RAM/disk)
- No cron job visualization
- No session/sub-agent activity monitoring
- No AI chat for asking about your deployment
- No pixel-art agent visualization
- No hardware monitoring
- No service controls (start/stop/restart gateway)
### Dashboard (openclaw-dashboard) — What We Pull
**Data Collection (Go):**
- `refresh.go` — main collector, reads OpenClaw filesystem + gateway API
- `refresh_sessions.go` — session listing, model resolution
- `refresh_tokens.go` — token usage tracking
- `cron_state` — cron job parsing and status
- `system.go` — CPU, RAM, swap, disk, gateway runtime probes
**API Endpoints:**
- `/api/refresh` — stale-while-revalidate data.json
- `/api/chat` — AI chat via OpenClaw gateway
- `/api/system` — live host metrics
- `/api/logs` — merged log tail
- `/api/errors` — aggregated error feed
**Frontend:**
- Pure HTML/CSS/JS (single `index.html`) — we'll rewrite as React components
- State management: 7 plain objects (State, DataLayer, DirtyChecker, Renderer, Theme, Chat, App)
- SVG chart rendering (cost trends, model breakdown, sub-agent activity)
- 6 themes with 19 CSS color variables each
**Integration Approach:**
- Port the Go data collection to Python services that hit the OpenClaw gateway API
- Replace the embedded HTML frontend with React components in the Next.js app
- Use the existing gateway RPC connection in Mission Control's backend
- Add PostgreSQL models for tracking data (cost snapshots, cron states, session events)
### Pixel Agents (openclaw-pixel-agents-dashboard) — What We Pull
**Backend (Node/Express):**
- `sessionWatcher.ts` — tails JSONL session files, parses events
- `spawner.ts` — spawns sub-agents via gateway API
- `services.ts` — gateway service controls (start/stop/restart)
- `hardware.ts` — hardware stats collection
- `openclawParser.ts` — JSONL event parsing
- WebSocket broadcasting to frontend
**Frontend (React/Vite):**
- Pixel-art canvas renderer (`OfficeCanvas.tsx`, game loop, character sprites)
- Activity bubbles, conversation heat overlays
- Spawn chat panel, session info panel
- Server rack (hardware monitor), breaker panel (service controls)
- Ham radio (update checker), fire alarm (gateway restart)
**Integration Approach:**
- Port JSONL session watcher to Python (watch OpenClaw session directory)
- Move sub-agent spawning to use Mission Control's existing gateway RPC
- Rebuild the pixel-art canvas as a React component within Next.js
- Add WebSocket support to FastAPI for real-time agent events
- Hardware stats collected via the gateway's `health` and `status` methods
---
## Implementation Plan
### Phase 1: Foundation Setup (Week 1)
**1.1 — Fork and Stand Up Base**
- Fork `abhi1693/openclaw-mission-control` to our org
- Stand up local dev environment (Docker Compose: Postgres + Redis + backend + frontend)
- Verify all existing features work: auth, boards, tasks, agents, gateways, approvals
- Document the data model and API surface
**1.2 — Add Tracking Models (Backend)**
- Create new PostgreSQL models:
- `CostSnapshot` — daily cost tracking per model/gateway
- `CronJobStatus` — cron schedule, last/next run, duration, status
- `SessionEvent` — session start/stop, model, tokens, context %
- `SubAgentRun` — sub-agent spawn, cost, duration, status
- `SystemHealthMetric` — CPU, RAM, disk, swap, gateway uptime
- `AlertRule` — configurable alert thresholds
- Create Alembic migration
- Add CRUD API endpoints under `/api/monitoring/`
**1.3 — Gateway Data Collection Service**
- Create `app/services/monitoring/gateway_collector.py`
- Reuse existing `gateway_rpc.py` to poll:
- `usage.cost` — cost data
- `usage.status` — token counts
- `cron.list` / `cron.status` — cron jobs
- `sessions.list` / `sessions.preview` — sessions
- `agents.list` — agents
- `health` — gateway health
- `status` — gateway runtime status
- Run as background task (asyncio) with configurable intervals
- Store collected data in the new models
### Phase 2: Tracking Dashboard (Week 2)
**2.1 — Monitoring Pages (Frontend)**
- New Next.js routes:
- `/monitoring` — main dashboard (cost cards, system health, alerts)
- `/monitoring/costs` — detailed cost breakdown with charts
- `/monitoring/sessions` — active sessions, sub-agent activity
- `/monitoring/crons` — cron job management
- `/monitoring/system` — CPU/RAM/disk/gateway health
**2.2 — Cost Tracking UI**
- Port dashboard's cost cards and donut chart to React/Recharts
- Today's cost, all-time cost, projected monthly
- Per-model cost breakdown (7d/30d/all-time tabs)
- Cost trend line chart (SVG → Recharts)
**2.3 — Session & Sub-Agent UI**
- Active sessions with model, type badges (DM/group/cron/subagent)
- Context % bars, token counts
- Sub-agent activity grid with cost/duration/status
- Session detail panel with conversation preview
**2.4 — Cron Job Management**
- Cron job list with schedule, status, last/next run
- Run history with duration and status badges
- Trigger manual run from UI
- Add/edit/delete cron jobs (using existing gateway RPC)
**2.5 — System Health**
- Gateway status card (uptime, PID, memory, compaction)
- CPU/RAM/swap/disk gauge cards (configurable thresholds)
- Alert banner for high cost, failed crons, gateway offline
- Auto-refresh with countdown timer
**2.6 — AI Chat Panel**
- Port dashboard's AI chat to React component
- Uses OpenClaw gateway's `/v1/chat/completions` endpoint
- Context-aware: feed live monitoring data into system prompt
- Persistent chat history per user
### Phase 3: Agent Visualization (Week 3)
**3.1 — Pixel Agent Canvas**
- Port the pixel-art office scene to React (Canvas component)
- Agent sprites with activity state (working, idle, talking)
- Activity bubbles showing current task/conversation
- Conversation heat glow based on recent activity
- Day/night ambient cycle
- Pan/zoom controls (touch + mouse)
**3.2 — Real-Time Agent Events**
- Add FastAPI WebSocket endpoint (`/ws/agents`)
- Port JSONL session watcher to Python:
- Watch `~/.openclaw/agents/*/sessions/*.jsonl`
- Parse events (tool calls, responses, status changes)
- Broadcast to connected WebSocket clients
- Activity ticker component (recent agent actions scrolling by)
**3.3 — Sub-Agent Spawner**
- Spawn panel integrated into the canvas view
- Click agent → "Spawn sub-agent" button
- Mini-chat for tasking the sub-agent
- Session info panel for active sub-agents
- Uses existing `agents.create` gateway RPC
**3.4 — Hardware Monitor & Service Controls**
- Server rack component (CPU/GPU/RAM/disk/network gauges)
- Breaker panel for gateway start/stop/restart
- Ham radio component for OpenClaw update checking
- All using existing gateway RPC methods (`health`, `status`, `update.run`)
### Phase 4: Integration & Polish (Week 4)
**4.1 — Navigation Integration**
- Add "Monitoring" and "Agents" sections to Mission Control sidebar
- Dashboard home page shows summary cards (cost, health, agent count)
- Deep links from monitoring → agents → pixel view
**4.2 — Theme System**
- Port the 6 dashboard themes into Mission Control's Tailwind config
- Theme picker in header (persists via localStorage)
- Glass morphism effects where appropriate
**4.3 — Alert System**
- Configurable alert rules (cost threshold, cron failure, context %, memory)
- Alert banner on every page when active
- Alert history in activity feed
- Notification delivery via webhooks or in-app
**4.4 — Data Sync Strategy**
- Primary: Gateway RPC polling (configurable intervals)
- Secondary: JSONL file watching for real-time agent events
- Tertiary: REST API for manual refresh
- WebSocket push for live updates to connected browsers
- Stale-while-revalidate caching pattern
---
## File Structure (Additions to Mission Control)
```
backend/
├── app/
│ ├── models/
│ │ ├── monitoring.py # CostSnapshot, CronJobStatus, SessionEvent, etc.
│ │ └── alert_rules.py # AlertRule model
│ ├── api/
│ │ ├── monitoring.py # Cost, session, cron endpoints
│ │ ├── monitoring_system.py # System health endpoints
│ │ └── agent_events.py # WebSocket endpoint for agent events
│ └── services/
│ ├── monitoring/
│ │ ├── gateway_collector.py # Polls OpenClaw gateway for data
│ │ ├── jsonl_watcher.py # Watches session JSONL files
│ │ ├── cost_tracker.py # Cost aggregation and projection
│ │ └── alert_engine.py # Alert rule evaluation
│ └── openclaw/
│ └── (existing — no changes needed)
├── migrations/
│ └── versions/
│ └── xxx_add_monitoring_models.py
frontend/
├── src/
│ ├── app/
│ │ ├── monitoring/
│ │ │ ├── page.tsx # Main monitoring dashboard
│ │ │ ├── costs/page.tsx # Cost detail page
│ │ │ ├── sessions/page.tsx # Session detail page
│ │ │ ├── crons/page.tsx # Cron management page
│ │ │ └── system/page.tsx # System health page
│ │ └── agents/
│ │ └── pixel/page.tsx # Pixel agent canvas page
│ ├── components/
│ │ ├── monitoring/
│ │ │ ├── CostCards.tsx
│ │ │ ├── CostTrendChart.tsx
│ │ │ ├── ModelBreakdownChart.tsx
│ │ │ ├── SessionTable.tsx
│ │ │ ├── SubAgentActivity.tsx
│ │ │ ├── CronJobList.tsx
│ │ │ ├── SystemHealthCards.tsx
│ │ │ ├── AlertBanner.tsx
│ │ │ └── AiChatPanel.tsx
│ │ ├── agents/
│ │ │ ├── PixelCanvas.tsx
│ │ │ ├── AgentSprite.tsx
│ │ │ ├── ActivityBubble.tsx
│ │ │ ├── ConversationHeat.tsx
│ │ │ ├── SpawnPanel.tsx
│ │ │ ├── ServerRack.tsx
│ │ │ └── BreakerPanel.tsx
│ │ └── (existing Mission Control components)
│ └── lib/
│ ├── monitoring-api.ts # API client for monitoring endpoints
│ └── agent-events.ts # WebSocket client for agent events
```
---
## Key Integration Points
### Gateway Communication
All three projects talk to the OpenClaw gateway. Mission Control already has the richest integration (`gateway_rpc.py` with 60+ methods). We reuse this for everything:
| Feature | Gateway Methods Used |
|---|---|
| Cost tracking | `usage.cost`, `usage.status` |
| Session monitoring | `sessions.list`, `sessions.preview` |
| Cron management | `cron.list`, `cron.status`, `cron.add`, `cron.update`, `cron.remove`, `cron.run` |
| Agent management | `agents.list`, `agents.create`, `agents.update`, `agents.delete` |
| System health | `health`, `status`, `logs.tail` |
| Sub-agent spawning | `agents.create`, `sessions.patch` |
| Service controls | `config.get`, `config.set`, `update.run` |
### Real-Time Updates
- Dashboard uses polling (60s auto-refresh)
- Pixel agents uses WebSocket (real-time JSONL events)
- Mission Control uses TanStack Query (polling + cache invalidation)
**Our approach:** WebSocket for agent events (real-time pixel animation), TanStack Query with 30s polling for monitoring data, SSE for alerts.
### Auth
- Mission Control supports Clerk JWT and local bearer token
- Dashboard is auth-free (localhost only)
- Pixel agents uses gateway token
**Our approach:** Inherit Mission Control's auth system. Local mode for self-hosted, Clerk for multi-tenant. Monitoring and agent data scoped to organization + gateway.
---
## Dependency Summary
| Layer | Technology | Source |
|---|---|---|
| Backend framework | FastAPI + SQLModel | Mission Control |
| Database | PostgreSQL + Alembic | Mission Control |
| Job queue | Redis + RQ | Mission Control |
| Frontend framework | Next.js 16 + React 19 | Mission Control |
| UI primitives | Radix UI + Tailwind | Mission Control |
| Charts | Recharts (existing) | Mission Control |
| Pixel canvas | HTML5 Canvas (new) | Pixel Agents → React port |
| WebSocket | FastAPI WebSocket (new) | Pixel Agents → Python port |
| Auth | Clerk / local bearer token | Mission Control |
| Gateway RPC | websockets Python (existing) | Mission Control |
**No new backend languages.** Go and Node/Express are NOT added — their functionality ports to Python services within the existing FastAPI app.
---
## Risk Assessment
| Risk | Impact | Mitigation |
|---|---|---|
| Canvas rendering performance in React | Medium | Use `useRef` + `requestAnimationFrame`, not React state for animation |
| Go dashboard data collection rewritten in Python | Medium | Port logic faithfully; test against same OpenClaw data |
| JSONL file watching reliability | Medium | Use `watchdog` library + fallback polling |
| Theme system merge (6 themes × 2 systems) | Low | Map dashboard's 19 CSS vars to Tailwind config |
| Pixel assets licensing | Low | MIT licensed, attribution in ASSET-LICENSE.md |
| Gateway RPC version compatibility | Low | Already handled by protocol version negotiation in `gateway_rpc.py` |
---
## Success Metrics
1. **All monitoring features** from dashboard available in Mission Control UI
2. **Pixel agent visualization** showing real-time agent activity
3. **Single Docker Compose** brings up the entire system
4. **Single auth system** — no separate logins
5. **Single gateway connection** — reused across all features
6. **No Go or Node backend** — everything in Python/FastAPI
7. **All existing Mission Control features** still work (boards, tasks, approvals, etc.)