vibn-agent-runner/AGENT_EXECUTION_ARCHITECTURE.md

# VIBN Agent Execution Architecture

## Goal

Give every product builder a personal AI that thinks like a COO — one that understands their product, monitors what's happening, and can direct specialists to get work done on their behalf. The user talks to one AI. That AI figures out what needs to happen and delegates accordingly.

---

## The Mental Model

**The user has one AI.** It happens to have specialists behind it.

The user does not manage or navigate to Code agents, Growth agents, or Analytics agents. They talk to their Assist AI — their personal product COO — and that COO routes work to the right specialist internally. The module tabs (Code, Growth, Analytics) show the *output and status* of delegated work, not separate AI interfaces the user has to navigate between.

This is the difference between a founder who has a COO (talks to one person who directs the team) vs. a founder who manages every department directly (exhausting, requires expertise they don't have).

---

## Core AI Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│  USER  (non-technical founder)                                      │
│                                                                     │
│  "I want more signups"                                              │
│  "Why did revenue drop last week?"                                  │
│  "Add a referral program"                                           │
│  "What should I build next?"                                        │
└───────────────────────────┬─────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────────┐
│  ASSIST — Personal AI / COO  (Tier B: Claude Sonnet)                │
│                                                                     │
│  The user's only AI interface. Reasons across the entire            │
│  business. Surfaces insights proactively. Delegates all work        │
│  to the right specialist. Never asks the user to choose which       │
│  module or agent to use — it figures that out.                      │
│                                                                     │
│  Full context available:                                            │
│    - PRD + product vision                                           │
│    - Everything that's been built (Gitea)                           │
│    - All past agent sessions and their outcomes                     │
│    - Live analytics and deployment status                           │
│    - Persistent memory: decisions made, patterns, preferences       │
│    - Web search: competitors, market trends, growth tactics         │
│                                                                     │
│  Proactive behaviors:                                               │
│    - Surfaces anomalies before being asked                          │
│    - Tracks open questions and follows up                           │
│    - Monitors deploy outcomes and flags regressions                 │
│    - Briefs the founder on what happened while they were away       │
└──────────────┬───────────────┬───────────────┬──────────────────────┘
               │               │               │
     delegates │     delegates │     delegates │
               ▼               ▼               ▼
  ┌────────────────┐ ┌──────────────────┐ ┌────────────────────┐
  │  CODE ADVISOR  │ │  GROWTH ADVISOR  │ │ ANALYTICS ADVISOR  │
  │  (Tier B)      │ │  (Tier B)        │ │ (Tier B)           │
  │                │ │                  │ │                    │
  │ Technical      │ │ Acquisition,     │ │ Data queries,      │
  │ scoping.       │ │ activation,      │ │ anomaly detection, │
  │ Reads codebase │ │ retention.       │ │ correlates data    │
  │ before writing │ │ Researches what  │ │ with deploys and   │
  │ a single line. │ │ works for this   │ │ events.            │
  │                │ │ product type.    │ │                    │
  │ NOT user-      │ │ NOT user-        │ │ NOT user-          │
  │ facing.        │ │ facing.          │ │ facing.            │
  └───────┬────────┘ └────────┬─────────┘ └──────────┬─────────┘
          │                   │                       │
          └───────────────────┴───────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│  ORCHESTRATOR  (Tier A: Gemini Flash)                               │
│                                                                     │
│  Receives a structured plan from whichever Advisor handled the      │
│  work. Breaks it into discrete tasks. Assigns tiers based on        │
│  complexity. Runs tasks in parallel where possible.                 │
└──────────┬───────────────┬───────────────┬──────────────────────────┘
           ↓               ↓               ↓
    ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
    │   TIER A    │ │   TIER B    │ │   TIER C    │
    │Gemini Flash │ │Claude Sonnet│ │Claude Sonnet│
    │             │ │             │ │  (→ Opus)   │
    │Simple edits │ │Feature work │ │Complex arch │
    │Copy, rename │ │New components│ │Hard debugs  │
    │Config tweaks│ │API routes   │ │Refactors    │
    └─────────────┘ └─────────────┘ └─────────────┘
           ↓               ↓               ↓
┌─────────────────────────────────────────────────────────────────────┐
│  EXECUTION RUNTIME  (vibn-agent-runner)                             │
│                                                                     │
│  - Persistent dev environment: Node, Python, npm, git               │
│  - Agent executes commands, writes files, runs builds               │
│  - Self-corrects on errors (re-prompts with error context)          │
│  - Auto-commits to Gitea on completion                              │
│  - Triggers Coolify redeploy automatically                          │
│  - Session persists even if browser is closed                       │
└─────────────────────────────────────────────────────────────────────┘
```

**The confirmation moment is critical UX.** Before any work is delegated, Assist always surfaces a plain-language summary card:
> *"Here's my plan: I'm going to add a referral program. That means building a referral code system, an invite email, and a dashboard for your users. This will take ~8 minutes. Want me to start?"*

The user never sees task tiers, agent assignments, or technical scope. They see what's happening and confirm.

---

## Assist — Personal AI / COO

The user's primary and only AI interface. All other specialists work behind it.

### What it does

**Reactive** — answers anything the user asks:
- "What should I build next?"
- "Why did signups drop?"
- "How does our product compare to [competitor]?"
- "Add a referral program"
- "Fix the checkout bug"

**Proactive** — surfaces things without being asked:
- Detects when a deploy caused an error spike
- Flags when a key metric moves significantly
- Follows up on decisions made in past conversations ("You said you wanted to revisit the onboarding flow this week")
- Delivers a brief when the founder returns after time away

### Full example

```
User: "I want more signups but I don't know what's blocking it"
  ↓
Assist COO
  - Pulls current signup funnel from analytics
  - Reads landing page + onboarding flow from Gitea
  - Checks recent deploys for regressions
  - Searches web for conversion benchmarks for this product category
  - Cross-references PRD: what was the original promise to users?
  - Asks: "Are people finding you organically or are you mostly getting
    referrals? That changes what I'd recommend."
  - User: "Mostly referrals from friends"
  - COO: "Then your acquisition isn't the problem — your activation is.
    People aren't converting after they land. Your onboarding skips the
    'aha moment'. Here's what I'd do..."
  - Presents plan: "1. Rewrite the onboarding welcome email.
    2. Add an empty-state prompt on the dashboard.
    3. Test a 30-day trial instead of 14.
    Shall I start?"
  ↓
On confirm → delegates to Growth Advisor (items 1–2) + Code Advisor (item 3)
  ↓
Orchestrator breaks into tasks → Execution agents build and deploy
  ↓
Assist follows up: "Done. The onboarding email and dashboard prompt are live.
The trial length change needs your input on pricing — want to talk through it?"
```

### What Assist does NOT do itself
- Write or execute code → delegates to Code Advisor
- Design and run growth experiments → delegates to Growth Advisor
- Run raw data queries → delegates to Analytics Advisor

The user never has to know this routing is happening.

### Persistent Memory

Assist maintains a memory of the product that grows over time:

```
"User prefers simple, fast solutions over architecturally correct ones"
"Decided against social login in Jan 2026 — too complex for current stage"
"Trial length has been discussed 3x — founder is nervous about revenue impact"
"Mobile conversion has been an open problem since Nov 2025"
```

This is what makes it feel like a COO rather than a chatbot. It remembers context across sessions so the user never has to repeat themselves.

---

## Specialist Advisors (Delegated, Not User-Facing)

These three Advisors are invoked by Assist, not by the user directly. Their module tabs (Code, Growth, Analytics) in the sidebar show **status and output** — not chat interfaces.

### Code Advisor

Handles anything that requires changes to the codebase. Scopes work technically before a single line is written.

```
[Invoked by Assist with]: "Add Stripe checkout to the admin app"
  ↓
Code Advisor
  - Reads package.json — Stripe not yet installed
  - Reads existing payment-related files (none found)
  - Searches: current Stripe Next.js best practices
  - Scopes: "3 tasks: install SDK, create checkout API route,
    add checkout button to pricing page"
  - Returns structured plan to Orchestrator
  ↓
Orchestrator:
  Task 1 (Tier A): npm install stripe
  Task 2 (Tier B): POST /api/checkout route
  Task 3 (Tier B): CheckoutButton component on pricing page
  ↓
Execution → commit → redeploy
  ↓
Assist reports back to user: "Stripe checkout is live."
```

**Context it uses:**
- Full codebase via Gitea
- Past agent sessions and what they changed
- Build logs and Coolify deploy history
- Web search: API docs, package versions, error solutions

### Growth Advisor

Handles acquisition, activation, and retention work. Researches what's proven to work before proposing anything.

```
[Invoked by Assist with]: "Build a referral program"
  ↓
Growth Advisor
  - Checks current user count + retention from analytics
  - Reviews existing onboarding flow from codebase
  - Searches: referral mechanics that work for this product type + stage
  - Proposes: "Double-sided reward — referrer gets 1 month free,
    referee gets 14-day trial. This pattern converts at ~18% for
    B2B SaaS at your stage."
  - Returns scoped plan
  ↓
Orchestrator → 4 tasks → Execution
```

**Context it uses:**
- User base metrics from Analytics
- Current user flow from Gitea
- Past growth experiments and outcomes
- Web search: growth playbooks, conversion benchmarks, competitor tactics

### Analytics Advisor

Handles all data interpretation. Correlates numbers with events to find the actual story behind a metric.

```
[Invoked by Assist with]: "Why did signups drop last week?"
  ↓
Analytics Advisor
  - Queries signup data: 30-day trend, last week specifically
  - Checks Gitea: any deploys last week? Yes — Tuesday at 2pm
  - Checks error logs: mobile error rate went from 0.2% to 8.4% post-deploy
  - Finds: specific commit that changed signup form validation
  - Returns: "Tuesday's deploy broke mobile signup. Here's the commit."
  ↓
Assist reports to user + asks: "Want me to fix it?"
If yes → delegates to Code Advisor
```

**Context it uses:**
- Product event data (signups, activations, churns, feature usage)
- Deployment history (Gitea commits + Coolify deploy timestamps)
- Error logs and performance metrics
- Web search: category benchmarks for comparison

---

## Shared Execution Infrastructure

```
┌───────────────────────────────────────────────────────────────────┐
│  vibn-agent-runner  (Node.js/TypeScript)                          │
│                                                                   │
│  - Receives task + tier assignment from Orchestrator              │
│  - Instantiates the correct LLM client (Tier A/B/C)              │
│  - Executes tools: write_file, execute_command, read_file, etc.   │
│  - Self-corrects on errors (re-prompts with error output)         │
│  - Writes step-by-step output to Postgres after every step        │
│  - Auto-commits to Gitea on completion                            │
│  - Triggers Coolify redeploy automatically                        │
│  - Session persists even if browser tab is closed                 │
└──────────────────────┬────────────────────────────────────────────┘
                       │ git push (auto on completion)
                       ▼
┌───────────────────────────────────────────────────────────────────┐
│  Gitea  (git.vibnai.com)                                          │
│  - Source of truth for all committed code                         │
│  - Webhook → triggers Coolify auto-deploy                         │
└──────────────────────┬────────────────────────────────────────────┘
                       │ auto-deploy
                       ▼
┌───────────────────────────────────────────────────────────────────┐
│  Coolify  (coolify.vibnai.com)                                    │
│  - Builds, deploys, manages domains + SSL                         │
└───────────────────────────────────────────────────────────────────┘
```

---

## Model Tier Reference

| Tier | Model | Used For | Cost Profile |
|------|-------|----------|--------------|
| A | Gemini 2.5 Flash | Orchestration, simple tasks, routing, summaries | Very low |
| B | Claude Sonnet 4.6 | Assist COO, specialist advisors, feature-level tasks | Medium |
| C | Claude Sonnet 4.6 (→ Opus) | Hard debugging, architecture changes, complex reasoning | High |

All tiers overridable via env vars: `TIER_A_MODEL`, `TIER_B_MODEL`, `TIER_C_MODEL`.

---

## Data Model

### `agent_sessions` table (Postgres) — exists

```sql
CREATE TABLE agent_sessions (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id      TEXT NOT NULL,
  module          TEXT NOT NULL DEFAULT 'code',
                  -- code | growth | analytics (delegated modules)
  app_name        TEXT NOT NULL,
  app_path        TEXT NOT NULL,
  task            TEXT NOT NULL,
  plan            JSONB,
  status          TEXT NOT NULL DEFAULT 'pending',
                  -- pending | running | done | approved | failed | stopped
  output          JSONB NOT NULL DEFAULT '[]',
  changed_files   JSONB NOT NULL DEFAULT '[]',
  error           TEXT,
  started_at      TIMESTAMPTZ,
  completed_at    TIMESTAMPTZ,
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);
```

### `advisor_conversations` table (Postgres) — to be built

Stores every conversation the COO has had with the user. Each row is one session (may span many turns).

```sql
CREATE TABLE advisor_conversations (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id      TEXT NOT NULL,

  -- 'assist' for user-facing COO conversations.
  -- 'code' | 'growth' | 'analytics' for internal specialist invocations.
  module          TEXT NOT NULL DEFAULT 'assist',

  messages        JSONB NOT NULL DEFAULT '[]',
  -- Full message history: [{role, content, timestamp}]

  context_snapshot JSONB,
  -- State of PRD, codebase summary, analytics at conversation start.
  -- Used for debugging ("what did it know when it said that?")

  outcome         TEXT,
  -- null | 'tasked' | 'declined' | 'deferred' | 'monitoring'

  -- If this conversation resulted in execution, link to the session(s)
  session_ids     UUID[] DEFAULT '{}',

  created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);
```

### `advisor_memory` table (Postgres) — to be built

Long-term memory for the COO. Persists facts, decisions, and patterns across conversations. This is what makes the AI feel like a real assistant rather than a stateless chatbot.

```sql
CREATE TABLE advisor_memory (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id  TEXT NOT NULL,

  -- Category of memory
  category    TEXT NOT NULL,
  -- 'decision'    — "Decided not to add social login — too complex for now"
  -- 'preference'  — "Founder prefers speed over architectural correctness"
  -- 'open'        — "Mobile conversion problem still unresolved"
  -- 'context'     — "Target user is non-technical indie founders"
  -- 'experiment'  — "Tried 30-day trial in Feb — didn't impact conversion"

  key         TEXT NOT NULL,   -- short label for retrieval
  value       TEXT NOT NULL,   -- full memory content
  confidence  REAL DEFAULT 1.0, -- 0–1, decays if contradicted

  -- Where this memory came from
  source_conversation_id UUID REFERENCES advisor_conversations(id),

  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);
```

---

## Current End-to-End Status (March 2026)

### What works today

| Step | Status |
|------|--------|
| User types task → session created in Postgres | ✅ |
| Agent runner receives task, calls Claude Sonnet 4.6 via Vertex AI | ✅ |
| Agent executes commands and writes files in runner workspace | ✅ |
| Step output streamed to Postgres in real time | ✅ |
| Frontend polls and shows live output log | ✅ |
| Auto-commit + push to Gitea on completion | ✅ |
| Gitea webhook triggers Coolify auto-deploy | ✅ |
| Browse tab shows latest committed files | ✅ |
| Project tabs (Atlas/PRD/Build/Growth/Assist/Analytics) in sidebar | ✅ |

### In Progress

| Step | Status |
|------|--------|
| Assist COO + specialist advisor architecture | 🔶 Designing |

### Not Yet Started

| Step | Status |
|------|--------|
| `advisor_conversations` + `advisor_memory` DB tables | ⬜ |
| Assist COO (user-facing personal AI) | ⬜ |
| Code Advisor (delegated specialist) | ⬜ |
| Growth Advisor (delegated specialist) | ⬜ |
| Analytics Advisor (delegated specialist) | ⬜ |
| Orchestrator (task decomposition + tier routing) | ⬜ |
| Proactive monitoring (anomaly detection, briefings) | ⬜ |
| Parallel task execution | ⬜ |
| WebSocket streaming (replace polling) | ⬜ |
| Terminal tab (xterm.js → live container PTY) | ⬜ |

---

## Build Phases

### Phase 1 — Execution Foundation ✅
- [x] `agent_sessions` DB table + REST API routes
- [x] Agent runner with Claude Sonnet (Tier B)
- [x] Three-tier LLM clients (Gemini Flash, Claude via Vertex)
- [x] Session output streamed to Postgres per step
- [x] Auto-commit + Coolify deploy on completion
- [x] Frontend: Browse / Agent / Terminal tabs
- [x] Adaptive polling, auto-select active session
- [x] Context-aware task input (locked while running)
- [x] Project tabs moved to sidebar (Atlas/PRD/Build/Growth/Assist/Analytics)

### Phase 2 — Per-project Sandboxed Workspaces
- [ ] Per-project ephemeral container (cold tier — wakes on demand)
- [ ] Agent `execute_command` routes through the project workspace container
- [ ] Persistent volume per project for caches / installed deps
- [ ] In-browser file viewer reflects live agent workspace

### Phase 3 — Assist COO + Specialist Advisors
- [ ] `advisor_conversations` + `advisor_memory` DB tables + API
- [ ] Assist COO: stateful multi-turn conversation with full project context
- [ ] Assist COO: persistent memory across conversations (advisor_memory)
- [ ] Assist COO: web search tool (competitor research, growth tactics, docs)
- [ ] Code Advisor: reads codebase, scopes tasks before execution
- [ ] Growth Advisor: researches what works, proposes specific experiments
- [ ] Analytics Advisor: queries data, correlates with deploys, finds stories
- [ ] Assist delegates to specialists internally — user sees one AI
- [ ] Confirmation card UX: plain-language plan before any execution begins

### Phase 4 — Orchestrator
- [ ] Orchestrator: decomposes advisor plans into tiered tasks
- [ ] Parallel task execution (multiple agents simultaneously)
- [ ] Task dependency graph (task B waits for task A's output)
- [ ] Cross-module task routing (one confirm → Code + Growth tasks in parallel)

### Phase 5 — Streaming & Persistence
- [ ] WebSocket replaces polling for live output
- [ ] Browser reconnect: full log replay from Postgres + live tail
- [ ] Background notifications (in-app + email) on completion/failure
- [ ] Terminal tab: xterm.js connected to project workspace PTY

### Phase 6 — Proactive Intelligence
- [ ] Assist monitors for anomalies and surfaces them without being asked
- [ ] Morning briefing: digest of what happened since last session
- [ ] Memory improves over time: learns founder preferences and product patterns
- [ ] Orchestrator estimates task time + complexity before confirming
- [ ] Agents validate their own output (tests, type-checks, linting)

---

## Key Design Decisions

**Why one AI interface (the COO), not four module AIs?**
Non-technical founders don't want to manage a team of AI tools. They want to talk to one AI that knows their business and gets things done. Routing work to specialists is the COO's job, not the founder's. Hiding that complexity is the product.

**Why is Assist Tier B (Claude Sonnet) and not the fastest model?**
Assist is the only AI the user ever talks to. They'll judge the entire product by how well it understands them, how good its advice is, and how natural the conversation feels. This warrants the strongest reasoning model available. The savings come from using cheaper tiers for execution, not from cutting corners on the user-facing intelligence.

**Why does Assist have persistent memory (advisor_memory)?**
Without memory, every conversation starts from zero. The COO has to re-learn that the founder doesn't want to use social login, that mobile conversion is an ongoing problem, that the trial length decision was deliberate. Memory is what transforms a chatbot into an assistant that actually knows you.

**Why are the specialist advisors not user-facing?**
Users shouldn't have to decide "is this a Code question or a Growth question?" That's the COO's job. The module tabs exist to show what's been delegated and what the status is — not as separate AI chat interfaces. The user's relationship is with Assist, not with the individual specialists.

**Why is the Orchestrator Tier A (cheap)?**
Once a specialist Advisor has produced a clear, structured plan, decomposing it into tasks is mechanical. It doesn't require deep reasoning — it requires fast, reliable routing. Gemini Flash is ideal: low latency, high quota, very low cost.

**Why auto-commit by default?**
The target user is a non-technical founder. Requiring approval on every task creates friction and undermines the "describe it, it ships" value proposition. Gitea + Coolify already provide a rollback path if something goes wrong.

**Why store everything in Postgres?**
Browser sessions end. Postgres is the source of truth. Every conversation turn, every memory item, every execution step, every outcome is written immediately. The WebSocket stream (Phase 5) is a convenience layer on top of the database, not a replacement.