Mission-Control/PROJECT.md

396 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🎯 Mission Control — Project Plan
> Merge three OpenClaw dashboards into a single, unified Mission Control platform.
---
## Source Repos
| Repo | Purpose | Stack | Key Assets |
|---|---|---|---|
| [abhi1693/openclaw-mission-control](https://github.com/abhi1693/openclaw-mission-control) | **Base platform** — work orchestration, governance, gateway management | Python/FastAPI + PostgreSQL + Redis + Next.js (React 19) + Clerk auth + Docker Compose | Organizations, boards, tasks, tags, approvals, agents, gateways, webhooks, activity feed, skills marketplace |
| [mudrii/openclaw-dashboard](https://github.com/mudrii/openclaw-dashboard) | **Tracking layer** — real-time metrics, costs, crons, sessions, system health | Go binary (zero deps) + embedded HTML/JS + SVG charts | Cost cards, cron status, session tracking, sub-agent activity, AI chat, system metrics (CPU/RAM/disk), 6 themes, alerts, token usage |
| [jaffer1979/openclaw-pixel-agents-dashboard](https://github.com/jaffer1979/openclaw-pixel-agents-dashboard) | **Agent visualization** — pixel-art agent sprites, real-time activity | Node/Express + Vite + React 19 + Canvas/WebSocket + JSONL parsing | Agent sprites with activity bubbles, conversation heat, spawn sub-agents, hardware monitor, service controls, day/night cycle |
---
## Architecture Decision: What to Merge Into What
**Base: openclaw-mission-control** — this becomes the foundation because:
- It has the richest data model (organizations, boards, tasks, approvals, agents, gateways, webhooks)
- It has proper auth (Clerk or local bearer token)
- It has a full API layer (FastAPI with SQLModel/SQLAlchemy)
- It has multi-tenancy built in
- It has the most mature frontend (Next.js 16 + React 19 + TanStack Query + Recharts)
**Merge FROM dashboard** — extract the tracking/monitoring features:
- Cost tracking, token usage, model breakdown
- Cron job status, scheduling, last/next run
- Session tracking, sub-agent activity
- System health (CPU, RAM, disk, gateway status)
- AI chat panel (ask questions about your data)
- Alert system (high cost, failed crons, context usage)
- 6 themes + glass morphism UI
**Merge FROM pixel-agents** — extract the agent visualization:
- Pixel-art agent sprites in a shared office scene
- Real-time activity bubbles, conversation heat
- Sub-agent spawning from the UI
- Hardware monitor (CPU/GPU/RAM/disk/network)
- Service controls (start/stop/restart gateway)
- Day/night cycle ambient lighting
---
## Technical Analysis
### Base Platform (openclaw-mission-control)
**Backend:**
- Python 3.12+, FastAPI, SQLModel/SQLAlchemy, PostgreSQL, Redis
- Alembic migrations, RQ worker for webhooks
- Full OpenClaw gateway integration via WebSocket RPC (device pairing, control UI)
- Gateway methods: 60+ RPC calls for sessions, agents, cron, config, exec approvals, etc.
- Auth: Clerk JWT or local bearer token (≥50 chars)
**Frontend:**
- Next.js 16.1.7, React 19.2, TanStack Query v5, TanStack Table v8
- Radix UI primitives, Tailwind CSS, Recharts, React Markdown
- 40+ page routes (dashboard, boards, agents, approvals, gateways, skills, tags, etc.)
- Cypress E2E tests
**Data Model (27 tables):**
- Organizations, users, boards, board_groups, tasks, tags, approvals
- Agents, gateways, activity_events, board_webhooks, skills
- Custom fields, task dependencies, task fingerprints
- Board memory, board group memory, onboarding
**What it LACKS that the others have:**
- No real-time cost/token tracking
- No system health monitoring (CPU/RAM/disk)
- No cron job visualization
- No session/sub-agent activity monitoring
- No AI chat for asking about your deployment
- No pixel-art agent visualization
- No hardware monitoring
- No service controls (start/stop/restart gateway)
### Dashboard (openclaw-dashboard) — What We Pull
**Data Collection (Go):**
- `refresh.go` — main collector, reads OpenClaw filesystem + gateway API
- `refresh_sessions.go` — session listing, model resolution
- `refresh_tokens.go` — token usage tracking
- `cron_state` — cron job parsing and status
- `system.go` — CPU, RAM, swap, disk, gateway runtime probes
**API Endpoints:**
- `/api/refresh` — stale-while-revalidate data.json
- `/api/chat` — AI chat via OpenClaw gateway
- `/api/system` — live host metrics
- `/api/logs` — merged log tail
- `/api/errors` — aggregated error feed
**Frontend:**
- Pure HTML/CSS/JS (single `index.html`) — we'll rewrite as React components
- State management: 7 plain objects (State, DataLayer, DirtyChecker, Renderer, Theme, Chat, App)
- SVG chart rendering (cost trends, model breakdown, sub-agent activity)
- 6 themes with 19 CSS color variables each
**Integration Approach:**
- Port the Go data collection to Python services that hit the OpenClaw gateway API
- Replace the embedded HTML frontend with React components in the Next.js app
- Use the existing gateway RPC connection in Mission Control's backend
- Add PostgreSQL models for tracking data (cost snapshots, cron states, session events)
### Pixel Agents (openclaw-pixel-agents-dashboard) — What We Pull
**Backend (Node/Express):**
- `sessionWatcher.ts` — tails JSONL session files, parses events
- `spawner.ts` — spawns sub-agents via gateway API
- `services.ts` — gateway service controls (start/stop/restart)
- `hardware.ts` — hardware stats collection
- `openclawParser.ts` — JSONL event parsing
- WebSocket broadcasting to frontend
**Frontend (React/Vite):**
- Pixel-art canvas renderer (`OfficeCanvas.tsx`, game loop, character sprites)
- Activity bubbles, conversation heat overlays
- Spawn chat panel, session info panel
- Server rack (hardware monitor), breaker panel (service controls)
- Ham radio (update checker), fire alarm (gateway restart)
**Integration Approach:**
- Port JSONL session watcher to Python (watch OpenClaw session directory)
- Move sub-agent spawning to use Mission Control's existing gateway RPC
- Rebuild the pixel-art canvas as a React component within Next.js
- Add WebSocket support to FastAPI for real-time agent events
- Hardware stats collected via the gateway's `health` and `status` methods
---
## Implementation Plan
### Phase 1: Foundation Setup (Week 1)
**1.1 — Fork and Stand Up Base**
- Fork `abhi1693/openclaw-mission-control` to our org
- Stand up local dev environment (Docker Compose: Postgres + Redis + backend + frontend)
- Verify all existing features work: auth, boards, tasks, agents, gateways, approvals
- Document the data model and API surface
**1.2 — Add Tracking Models (Backend)**
- Create new PostgreSQL models:
- `CostSnapshot` — daily cost tracking per model/gateway
- `CronJobStatus` — cron schedule, last/next run, duration, status
- `SessionEvent` — session start/stop, model, tokens, context %
- `SubAgentRun` — sub-agent spawn, cost, duration, status
- `SystemHealthMetric` — CPU, RAM, disk, swap, gateway uptime
- `AlertRule` — configurable alert thresholds
- Create Alembic migration
- Add CRUD API endpoints under `/api/monitoring/`
**1.3 — Gateway Data Collection Service**
- Create `app/services/monitoring/gateway_collector.py`
- Reuse existing `gateway_rpc.py` to poll:
- `usage.cost` — cost data
- `usage.status` — token counts
- `cron.list` / `cron.status` — cron jobs
- `sessions.list` / `sessions.preview` — sessions
- `agents.list` — agents
- `health` — gateway health
- `status` — gateway runtime status
- Run as background task (asyncio) with configurable intervals
- Store collected data in the new models
### Phase 2: Tracking Dashboard (Week 2)
**2.1 — Monitoring Pages (Frontend)**
- New Next.js routes:
- `/monitoring` — main dashboard (cost cards, system health, alerts)
- `/monitoring/costs` — detailed cost breakdown with charts
- `/monitoring/sessions` — active sessions, sub-agent activity
- `/monitoring/crons` — cron job management
- `/monitoring/system` — CPU/RAM/disk/gateway health
**2.2 — Cost Tracking UI**
- Port dashboard's cost cards and donut chart to React/Recharts
- Today's cost, all-time cost, projected monthly
- Per-model cost breakdown (7d/30d/all-time tabs)
- Cost trend line chart (SVG → Recharts)
**2.3 — Session & Sub-Agent UI**
- Active sessions with model, type badges (DM/group/cron/subagent)
- Context % bars, token counts
- Sub-agent activity grid with cost/duration/status
- Session detail panel with conversation preview
**2.4 — Cron Job Management**
- Cron job list with schedule, status, last/next run
- Run history with duration and status badges
- Trigger manual run from UI
- Add/edit/delete cron jobs (using existing gateway RPC)
**2.5 — System Health**
- Gateway status card (uptime, PID, memory, compaction)
- CPU/RAM/swap/disk gauge cards (configurable thresholds)
- Alert banner for high cost, failed crons, gateway offline
- Auto-refresh with countdown timer
**2.6 — AI Chat Panel**
- Port dashboard's AI chat to React component
- Uses OpenClaw gateway's `/v1/chat/completions` endpoint
- Context-aware: feed live monitoring data into system prompt
- Persistent chat history per user
### Phase 3: Agent Visualization (Week 3)
**3.1 — Pixel Agent Canvas**
- Port the pixel-art office scene to React (Canvas component)
- Agent sprites with activity state (working, idle, talking)
- Activity bubbles showing current task/conversation
- Conversation heat glow based on recent activity
- Day/night ambient cycle
- Pan/zoom controls (touch + mouse)
**3.2 — Real-Time Agent Events**
- Add FastAPI WebSocket endpoint (`/ws/agents`)
- Port JSONL session watcher to Python:
- Watch `~/.openclaw/agents/*/sessions/*.jsonl`
- Parse events (tool calls, responses, status changes)
- Broadcast to connected WebSocket clients
- Activity ticker component (recent agent actions scrolling by)
**3.3 — Sub-Agent Spawner**
- Spawn panel integrated into the canvas view
- Click agent → "Spawn sub-agent" button
- Mini-chat for tasking the sub-agent
- Session info panel for active sub-agents
- Uses existing `agents.create` gateway RPC
**3.4 — Hardware Monitor & Service Controls**
- Server rack component (CPU/GPU/RAM/disk/network gauges)
- Breaker panel for gateway start/stop/restart
- Ham radio component for OpenClaw update checking
- All using existing gateway RPC methods (`health`, `status`, `update.run`)
### Phase 4: Integration & Polish (Week 4)
**4.1 — Navigation Integration**
- Add "Monitoring" and "Agents" sections to Mission Control sidebar
- Dashboard home page shows summary cards (cost, health, agent count)
- Deep links from monitoring → agents → pixel view
**4.2 — Theme System**
- Port the 6 dashboard themes into Mission Control's Tailwind config
- Theme picker in header (persists via localStorage)
- Glass morphism effects where appropriate
**4.3 — Alert System**
- Configurable alert rules (cost threshold, cron failure, context %, memory)
- Alert banner on every page when active
- Alert history in activity feed
- Notification delivery via webhooks or in-app
**4.4 — Data Sync Strategy**
- Primary: Gateway RPC polling (configurable intervals)
- Secondary: JSONL file watching for real-time agent events
- Tertiary: REST API for manual refresh
- WebSocket push for live updates to connected browsers
- Stale-while-revalidate caching pattern
---
## File Structure (Additions to Mission Control)
```
backend/
├── app/
│ ├── models/
│ │ ├── monitoring.py # CostSnapshot, CronJobStatus, SessionEvent, etc.
│ │ └── alert_rules.py # AlertRule model
│ ├── api/
│ │ ├── monitoring.py # Cost, session, cron endpoints
│ │ ├── monitoring_system.py # System health endpoints
│ │ └── agent_events.py # WebSocket endpoint for agent events
│ └── services/
│ ├── monitoring/
│ │ ├── gateway_collector.py # Polls OpenClaw gateway for data
│ │ ├── jsonl_watcher.py # Watches session JSONL files
│ │ ├── cost_tracker.py # Cost aggregation and projection
│ │ └── alert_engine.py # Alert rule evaluation
│ └── openclaw/
│ └── (existing — no changes needed)
├── migrations/
│ └── versions/
│ └── xxx_add_monitoring_models.py
frontend/
├── src/
│ ├── app/
│ │ ├── monitoring/
│ │ │ ├── page.tsx # Main monitoring dashboard
│ │ │ ├── costs/page.tsx # Cost detail page
│ │ │ ├── sessions/page.tsx # Session detail page
│ │ │ ├── crons/page.tsx # Cron management page
│ │ │ └── system/page.tsx # System health page
│ │ └── agents/
│ │ └── pixel/page.tsx # Pixel agent canvas page
│ ├── components/
│ │ ├── monitoring/
│ │ │ ├── CostCards.tsx
│ │ │ ├── CostTrendChart.tsx
│ │ │ ├── ModelBreakdownChart.tsx
│ │ │ ├── SessionTable.tsx
│ │ │ ├── SubAgentActivity.tsx
│ │ │ ├── CronJobList.tsx
│ │ │ ├── SystemHealthCards.tsx
│ │ │ ├── AlertBanner.tsx
│ │ │ └── AiChatPanel.tsx
│ │ ├── agents/
│ │ │ ├── PixelCanvas.tsx
│ │ │ ├── AgentSprite.tsx
│ │ │ ├── ActivityBubble.tsx
│ │ │ ├── ConversationHeat.tsx
│ │ │ ├── SpawnPanel.tsx
│ │ │ ├── ServerRack.tsx
│ │ │ └── BreakerPanel.tsx
│ │ └── (existing Mission Control components)
│ └── lib/
│ ├── monitoring-api.ts # API client for monitoring endpoints
│ └── agent-events.ts # WebSocket client for agent events
```
---
## Key Integration Points
### Gateway Communication
All three projects talk to the OpenClaw gateway. Mission Control already has the richest integration (`gateway_rpc.py` with 60+ methods). We reuse this for everything:
| Feature | Gateway Methods Used |
|---|---|
| Cost tracking | `usage.cost`, `usage.status` |
| Session monitoring | `sessions.list`, `sessions.preview` |
| Cron management | `cron.list`, `cron.status`, `cron.add`, `cron.update`, `cron.remove`, `cron.run` |
| Agent management | `agents.list`, `agents.create`, `agents.update`, `agents.delete` |
| System health | `health`, `status`, `logs.tail` |
| Sub-agent spawning | `agents.create`, `sessions.patch` |
| Service controls | `config.get`, `config.set`, `update.run` |
### Real-Time Updates
- Dashboard uses polling (60s auto-refresh)
- Pixel agents uses WebSocket (real-time JSONL events)
- Mission Control uses TanStack Query (polling + cache invalidation)
**Our approach:** WebSocket for agent events (real-time pixel animation), TanStack Query with 30s polling for monitoring data, SSE for alerts.
### Auth
- Mission Control supports Clerk JWT and local bearer token
- Dashboard is auth-free (localhost only)
- Pixel agents uses gateway token
**Our approach:** Inherit Mission Control's auth system. Local mode for self-hosted, Clerk for multi-tenant. Monitoring and agent data scoped to organization + gateway.
---
## Dependency Summary
| Layer | Technology | Source |
|---|---|---|
| Backend framework | FastAPI + SQLModel | Mission Control |
| Database | PostgreSQL + Alembic | Mission Control |
| Job queue | Redis + RQ | Mission Control |
| Frontend framework | Next.js 16 + React 19 | Mission Control |
| UI primitives | Radix UI + Tailwind | Mission Control |
| Charts | Recharts (existing) | Mission Control |
| Pixel canvas | HTML5 Canvas (new) | Pixel Agents → React port |
| WebSocket | FastAPI WebSocket (new) | Pixel Agents → Python port |
| Auth | Clerk / local bearer token | Mission Control |
| Gateway RPC | websockets Python (existing) | Mission Control |
**No new backend languages.** Go and Node/Express are NOT added — their functionality ports to Python services within the existing FastAPI app.
---
## Risk Assessment
| Risk | Impact | Mitigation |
|---|---|---|
| Canvas rendering performance in React | Medium | Use `useRef` + `requestAnimationFrame`, not React state for animation |
| Go dashboard data collection rewritten in Python | Medium | Port logic faithfully; test against same OpenClaw data |
| JSONL file watching reliability | Medium | Use `watchdog` library + fallback polling |
| Theme system merge (6 themes × 2 systems) | Low | Map dashboard's 19 CSS vars to Tailwind config |
| Pixel assets licensing | Low | MIT licensed, attribution in ASSET-LICENSE.md |
| Gateway RPC version compatibility | Low | Already handled by protocol version negotiation in `gateway_rpc.py` |
---
## Success Metrics
1. **All monitoring features** from dashboard available in Mission Control UI
2. **Pixel agent visualization** showing real-time agent activity
3. **Single Docker Compose** brings up the entire system
4. **Single auth system** — no separate logins
5. **Single gateway connection** — reused across all features
6. **No Go or Node backend** — everything in Python/FastAPI
7. **All existing Mission Control features** still work (boards, tasks, approvals, etc.)