master-ai/AGENT_TELEMETRY_STREAMING_PROJECT.md

# Agent telemetry & live execution stream — project spec

This document captures **concrete product and engineering additions** discussed for Vibn: moving from **poll-based session updates** and **in-memory jobs** to a **durable, ordered, push-friendly execution timeline**—the web equivalent of a terminal agent’s clarity (step-by-step visibility, tool boundaries, failures, and later multi-agent signals).

---

## 1. Why this exists

### Current behavior (baseline)

| Surface | How progress reaches the user | Limits |
|--------|------------------------------|--------|
| **Agent sessions** (`agent_sessions`) | Runner `PATCH`es `output`, `status`, `changed_files` to Next; UI **polls** `GET …/agent/sessions/[id]`. | Latency, reconnect story, no single ordered stream; rich semantics encoded only in `text`. |
| **Jobs** (`/api/agent/run`, `/api/jobs/:id`) | In-memory `job-store` (`progress`, `toolCalls[]`); UI polls job endpoint. | Lost on restart; not shared across runner replicas; not unified with session UI. |
| **Orchestrator / Atlas chat** | Request/response to runner; advisor path may be remote URL. | No execution timeline for “long COO run” in-product unless you add the same event layer. |

### Product intent

- **Trust during long runs**: users see *what* happened, *when*, and *whether something was blocked*—not only a final status.
- **Differentiation**: “Ink-like” clarity in the browser—structured steps, not a blob of logs.
- **Foundation for multi-agent**: handoffs, child work, and safety events need a **common event pipe**, not ad-hoc strings.

---

## 2. Goals

1. **Append-only execution events** with **monotonic ordering** (per session or per job), suitable for replay after refresh.
2. **Server-push to the client** (recommend **SSE** first; WebSocket if you need bi-directional on the same channel).
3. **Persistence** so reconnect, refresh, and horizontal scaling do not lose history.
4. **Single conceptual model** (`AgentEvent`) usable by:
   - Build → **Agent** tab (sessions),
   - **Job** flows (create/analyze-style),
   - optionally **orchestrator** long runs later.
5. **Backward compatibility** during rollout: existing `PATCH` + `output` can remain as a fallback or be fed from the same emitter.

### Non-goals (for v1)

- Full **OpenTelemetry** export (optional later).
- **Real-time collaborative** multi-user cursors on the same session.
- Merging **claude-code-fork**—this spec is **API + UI + persistence** only.

---

## 3. Concept: `AgentEvent`

### Core shape (suggested)

```ts
type AgentEvent = {
  seq: number;           // monotonic per stream (session_id or job_id)
  ts: string;            // ISO-8601
  runId: string;         // session UUID or job id — ties events to a run
  runKind: 'session' | 'job';
  phase: 'queued' | 'running' | 'completed' | 'failed' | 'stopped';

  type: AgentEventType;
  payload: Record<string, unknown>;  // type-specific
};

type AgentEventType =
  | 'run.started'
  | 'run.phase'              // e.g. planning, executing, committing
  | 'llm.turn.start'
  | 'llm.turn.end'
  | 'tool.start'
  | 'tool.end'
  | 'tool.output'            // chunked stdout/stderr if needed
  | 'safety.block'           // policy / protected path / command denied
  | 'file.changed'           // maps to today’s changed_files semantics
  | 'git.commit'
  | 'deploy.triggered'
  | 'deploy.status'
  | 'error'
  | 'run.completed'
  | 'handoff'                // v2: parent → child agent
  | 'child_job.started'      // v2: linked run id
  ;
```

### Mapping from today’s session `outputLine`

| Today (`outputLine.type`) | Suggested event(s) |
|---------------------------|--------------------|
| `step` / `info` | `run.phase` or `llm.turn.*` with summary in `payload.message` |
| `stdout` / `stderr` | `tool.output` or dedicated stream events |
| `error` | `error` + optional `safety.block` if policy-driven |
| `done` | `run.completed` |

Keep **human-readable `message`** on events for UI defaults; add **structured fields** (`tool`, `argsSummary`, `durationMs`) for timeline rendering and filters.

---

## 4. Architecture (high level)

```mermaid
flowchart LR
  subgraph runner [vibn-agent-runner]
    RA[runSessionAgent / runAgent]
    EMIT[emitAgentEvent]
  end
  subgraph api [vibn-frontend Next.js]
    ING[POST internal ingest or PATCH extend]
    DB[(Postgres agent_events)]
    SSE[SSE GET /api/.../stream]
  end
  subgraph browser [Browser]
    UI[Timeline + live log]
  end
  RA --> EMIT
  EMIT -->|HTTPS + secret or mTLS| ING
  ING --> DB
  UI -->|EventSource| SSE
  SSE --> DB
```

**Principles**

- **Runner remains stateless** regarding “truth”: it emits events; **Next + DB** are the source of truth for the UI (matches today’s session model).
- Alternatively, runner could expose **SSE directly**—usually worse for **auth**, **CORS**, and **one domain** for the product. Prefer **Next as SSE endpoint** reading from DB.

---

## 5. Backend: `vibn-agent-runner`

### 5.1 Emit from execution paths

| Location | Action |
|----------|--------|
| `agent-session-runner.ts` | Replace or supplement `patchSession` output-only updates with **`emitAgentEvent`** each turn / tool / error. |
| `runAgent` / tool loop (`executeTool`) | Same emitter for **job** runs. |
| `server.ts` `/agent/execute` | Emit `run.started` after 202; `run.completed` / `error` on exit. |
| Security / blocked tools (`security.ts` or equivalent) | Emit `safety.block` with reason code (no secrets in payload). |

### 5.2 Transport runner → Next

**Option A (recommended):** extend existing **PATCH** or add **`POST /api/internal/agent-events`** (or per-session batch append):

- Headers: `x-agent-runner-secret` (same as today’s PATCH).
- Body: single event or small batch `{ events: AgentEvent[] }` with server-assigned `seq` to avoid races.

**Option B:** Runner writes to **Redis/Postgres** directly—couples runner to DB credentials; only do if you already run runner inside the same trust zone with DB URL.

### 5.3 Jobs store

- **Short term:** continue in-memory for job metadata; **persist events** to Postgres keyed by `jobId`.
- **Medium term:** optional **Redis** for job status + pub/sub to Next for low-latency SSE fanout (only if DB polling becomes a bottleneck).

---

## 6. Backend: `vibn-frontend` (Next.js)

### 6.1 Persistence

**New table (example): `agent_run_events`**

| Column | Notes |
|--------|--------|
| `id` | UUID |
| `run_id` | Session id or job id (text) |
| `run_kind` | `'session' \| 'job'` |
| `seq` | BIGSERIAL or per-run sequence enforced with unique constraint `(run_id, seq)` |
| `project_id` | Nullable for jobs if not scoped |
| `event` | JSONB — full `AgentEvent` or `{ type, ts, payload }` |
| `created_at` | default now() |

Index: `(run_id, seq)` for range queries (`WHERE run_id = $1 AND seq > $lastSeen`).

**Optional:** migrate legacy `agent_sessions.output` to be **derived** (last N lines for email export) or **dual-write** during transition.

### 6.2 SSE route (example contract)

- **`GET /api/projects/[projectId]/agent/sessions/[sessionId]/events/stream`**
  - Auth: session cookie / same as GET session (user must own project).
  - Query: `?afterSeq=123` for replay.
  - Response: `text/event-stream`; each message: `data: {JSON}\n\n`.
  - Heartbeat comments every ~15–30s to keep proxies alive.

For **jobs** (if not project-scoped): `GET /api/jobs/[jobId]/events/stream` with appropriate auth.

### 6.3 Ingest route (runner-only)

- **`POST /api/internal/agent-events`** (or nested under project/session as you prefer).
- Validates `x-agent-runner-secret`.
- Inserts rows with **server-generated `seq`** (transaction per run or advisory lock per `run_id`).

---

## 7. Frontend (product UI)

### 7.1 Agent tab — timeline

- **EventSource** (SSE) subscription when session is `running`; on load, **fetch historical** events (`GET …/events?afterSeq=0` or SSE from 0).
- **Timeline components**:
  - Group by `llm.turn` / `tool.start`–`tool.end`.
  - Expandable tool args (sanitized).
  - Distinct styling for `safety.block` and `error`.
- **Reconnect**: on `EventSource` error, reopen with `lastSeq` from last received event.

### 7.2 Jobs / analyze flows

- Same timeline component keyed by `jobId` if you surface those runs in UI.
- Unifies mental model: “every run has a stream.”

### 7.3 Deprecate slow polling

- Reduce `GET …/agent/sessions/[id]` poll interval when SSE connected; keep **single poll** for `status` / `changed_files` if those stay on session row only, or **also** emit `file.changed` events and drive UI from stream + one final consistency read.

---

## 8. Security & privacy

- **Never** put tokens, env values, or full file contents in events by default; use **truncation** and **hashes** where needed.
- **`safety.block`**: log reason **code** + user-safe message; align with `security.ts` behavior.
- **Rate limits** on ingest endpoint (per `run_id` / per IP) to avoid abuse if misconfigured.

---

## 9. Environment variables

| Variable | Where | Purpose |
|----------|--------|---------|
| `AGENT_RUNNER_SECRET` | Runner + Next | Ingest / extended PATCH auth |
| `VIBN_API_URL` | Runner | Base URL for callbacks |
| `AGENT_RUNNER_URL` | Next | Start runs (unchanged) |

Add if needed:

| Variable | Purpose |
|----------|---------|
| `AGENT_EVENTS_INGEST_PATH` | Optional override for ingest URL |
| `SSE_MAX_BUFFER` | Cap replay batch size |

---

## 10. Phased roadmap (suggested)

### Phase 1 — Foundation

- [ ] Define `AgentEvent` TypeScript types in a **shared package** or duplicated minimal types in runner + frontend.
- [ ] Create `agent_run_events` (or equivalent) + migration.
- [ ] Implement **ingest** endpoint; wire **runner session path** to emit core events: `run.started`, `tool.start` / `tool.end`, `error`, `run.completed`, `file.changed`.
- [ ] **Dual-write**: keep existing `PATCH` `outputLine` so nothing breaks.

### Phase 2 — Push

- [ ] SSE route + **EventSource** in Agent tab.
- [ ] Backfill UI from DB on mount; then live tail.
- [ ] Lower or gate polling on `GET` session.

### Phase 3 — Jobs + durability

- [ ] Emit same events from **job** execution path; persist by `jobId`.
- [ ] Optional: replace in-memory job list with DB for **multi-instance** runner (later).

### Phase 4 — Rich semantics

- [ ] `safety.block` from policy layer.
- [ ] `deploy.*` events if Coolify integration is user-visible.
- [ ] **Multi-agent**: `handoff`, `child_job.*` with links in payload.

---

## 11. Success metrics

- Time-to-first-visible-step after **Run** &lt; **1s** p95 (SSE).
- After hard refresh mid-run, user sees **consistent history** (no duplicate seq, no gaps if you guarantee at-least-once ingest with idempotency keys later).
- Support tickets / confusion drops on “what is the agent doing?” (qualitative).

---

## 12. Related code (repo anchors)

Use these when implementing:

- Runner session loop + PATCH bridge: `vibn-agent-runner/src/agent-session-runner.ts`
- Runner HTTP: `vibn-agent-runner/src/server.ts` (`/agent/execute`, `/agent/stop`, `/agent/approve`, `/api/agent/run`, `/api/jobs/:id`)
- In-memory jobs: `vibn-agent-runner/src/job-store.ts`
- Next session API + runner callback: `vibn-frontend/app/api/projects/[projectId]/agent/sessions/[sessionId]/route.ts`
- Session create + fire-and-forget execute: `vibn-frontend/app/api/projects/[projectId]/agent/sessions/route.ts`

---

## 13. Open decisions

1. **Single table** for sessions + jobs vs **two tables** (simpler queries vs flexibility).
2. **Seq generation**: DB sequence per `run_id` vs global monotonic with `(run_id, seq)` composite only in app logic.
3. **Idempotency**: runner retries may duplicate events—use **`event_id` UUID** from runner for dedupe on ingest.
4. **Orchestrator chat**: treat as v2 unless you need a **COO run** timeline immediately.

---

*Document version: 1.0 — aligned with discussion of runner ↔ frontend telemetry, SSE-first delivery, Postgres persistence, and future multi-agent event types.*