vibn-frontend/COLLECTOR_EXTRACTOR_REFACTOR.md

# Collector & Extractor Refactor - Complete

## Overview

Refactored the Collector and Extraction Review phases to implement a proactive, collaborative workflow that guides users through setup and only chunks content they confirm is important.

---

## Changes Made

### 1. **Collector Phase (v2 Prompt)**

**Location:** `lib/ai/prompts/collector.ts`

**New Behavior:**
- ✅ **Proactive Welcome** - Greets new users with clear 3-step setup guide
- ✅ **3-Step Checklist Tracking:**
  1. Upload documents 📄
  2. Connect GitHub repo 🔗
  3. Install browser extension 🔌
- ✅ **Smart GitHub Analysis** - Automatically analyzes connected repos and presents findings
- ✅ **Conversational Handoff** - Asks "Is that everything?" when materials are detected
- ✅ **Automatic Transition** - Moves to extraction_review_mode when user confirms

**Key Changes:**
- Removed "Click Analyze Context button" instruction
- Added explicit checklist tracking based on `knowledgeSummary.bySourceType`
- Added welcome message with step-by-step guidance
- Emphasized ONE question at a time (not overwhelming)

---

### 2. **Extraction Review Phase (v2 Prompt)**

**Location:** `lib/ai/prompts/extraction-review.ts`

**New Behavior:**
- ✅ **Collaborative Review** - Presents each potential insight and asks "Is this important?"
- ✅ **Smart Chunking** - Only chunks content the user confirms is V1-critical
- ✅ **Semantic Boundaries** - Chunks by meaning (feature, persona, constraint), not character count
- ✅ **Tight Responses** - Guides a review process, not essays

**Workflow:**
1. **Read & Identify** - Find potential insights in documents/code
2. **Collaborative Review** - Show user the text, ask "Should I save this?"
3. **Chunk & Store** - Extract and store confirmed insights in AlloyDB
4. **Build Product Model** - Synthesize confirmed insights into `canonicalProductModel`

**Key Changes:**
- Removed automatic extraction behavior
- Added explicit "Is this important?" questioning pattern
- Emphasized showing ACTUAL TEXT from user's docs
- Added chunking strategy guidance (semantic, not arbitrary)

---

### 3. **UI Changes**

**Location:** `app/[workspace]/project/[projectId]/v_ai_chat/page.tsx`

**Changes:**
- ❌ Removed "Analyze Context" button
- ❌ Removed `isBatchExtracting` state
- ❌ Removed `handleBatchExtract` function
- ❌ Removed `Sparkles` icon import
- ✅ Kept "Reset Chat" button

**Rationale:**
- Transition to extraction happens conversationally ("Is that everything?" → "yes" → auto-transition)
- No manual button click needed
- Cleaner, less cluttered UI

---

### 4. **Auto-Chunking Disabled**

**Location:** `app/api/projects/[projectId]/knowledge/upload-document/route.ts`

**Changes:**
- ✅ Commented out `writeKnowledgeChunksForItem` fire-and-forget call
- ✅ Added comment: `// NOTE: Auto-chunking disabled - Extractor AI will collaboratively chunk important sections`

**Rationale:**
- Documents are stored whole in Firestore as `knowledge_items`
- Extractor AI reads them later and chunks only user-confirmed insights
- Prevents bloat in AlloyDB with irrelevant chunks

---

### 5. **PhaseHandoff Type Updates**

**Location:** `lib/types/phase-handoff.ts`

**Changes:**
- ✅ Added `'collector'` to `PhaseType` union
- ✅ Created `CollectorPhaseHandoff` interface with checklist fields:
  ```typescript
  confirmed: {
    hasDocuments?: boolean;
    documentCount?: number;
    githubConnected?: boolean;
    githubRepo?: string;
    extensionLinked?: boolean;
  }
  uncertain: {
    extensionDeclined?: boolean;
    noGithubYet?: boolean;
  }
  missing: string[];
  ```
- ✅ Added `CollectorPhaseHandoff` to `AnyPhaseHandoff` union

**Location:** `lib/types/project-artifacts.ts`

**Changes:**
- ✅ Updated `phaseHandoffs` to include `'collector'` key

---

## How It Works Now

### **User Journey:**

1. **Welcome (Collector)**
   - AI greets user: "Welcome to Vibn! Here's how this works: Step 1: Upload docs, Step 2: Connect GitHub, Step 3: Install extension"
   - User uploads documents via Context tab → AI confirms: "✅ I see you've uploaded 2 document(s)"
   - User connects GitHub → AI analyzes and presents: "✅ I can see your repo - it's built with Next.js, has 247 files..."
   - User installs extension → AI confirms: "✅ I see your browser extension is connected"

2. **Handoff Question (Collector)**
   - AI asks: "Is that everything you want me to work with for now? If so, I'll start digging into the details."
   - User says: "yes" / "yep" / "go ahead"

3. **Automatic Transition**
   - AI responds: "Perfect! Let me analyze what you've shared. This might take a moment..."
   - System automatically transitions to `extraction_review_mode`

4. **Collaborative Extraction (Extractor)**
   - AI says: "I'm reading through everything you've shared. Let me walk through what I found..."
   - AI presents each insight: "I found this section about [topic]: [quote]. Is this important for your V1 product? Should I save it?"
   - User says: "yes" → AI chunks and stores: "✅ Saved! I'll remember this for later phases."
   - User says: "no" → AI skips: "Got it, moving on..."

5. **Product Model Built**
   - After reviewing all docs, AI asks: "I've identified 12 key requirements. Does that sound right?"
   - AI synthesizes `canonicalProductModel` and transitions to Vision phase

---

## Extension Project Linking

**Current Status:**
- Extension uses `workspacePath` header to identify project context
- Extension sends chats to Vibn proxy with `x-workspace-path` header
- Vibn API uses `extractProjectName(workspacePath)` to link chats to projects
- **Limitation:** Extension doesn't explicitly link to a Vibn project ID yet

**Detection in Collector:**
- Checks `knowledgeSummary.bySourceType` for `'extension'` or `contextSources` with `type='extension'`
- If found: "✅ I see your browser extension is connected"
- If not: "Have you installed the Vibn browser extension yet?"

**Future Enhancement:**
- Add explicit project ID linking in extension settings
- Allow users to select which Vibn project their workspace maps to

---

## Files Changed

1. `lib/ai/prompts/collector.ts` - New v2 prompt (proactive, 3-step checklist)
2. `lib/ai/prompts/extraction-review.ts` - New v2 prompt (collaborative chunking)
3. `app/[workspace]/project/[projectId]/v_ai_chat/page.tsx` - Removed "Analyze Context" button
4. `app/api/projects/[projectId]/knowledge/upload-document/route.ts` - Disabled auto-chunking
5. `lib/types/phase-handoff.ts` - Added `CollectorPhaseHandoff` type
6. `lib/types/project-artifacts.ts` - Updated `phaseHandoffs` to include `'collector'`

---

## Testing Checklist

### **Collector Phase:**
- [ ] New project shows welcome message with 3-step guide
- [ ] Uploading doc triggers "✅ I see you've uploaded X document(s)"
- [ ] Connecting GitHub triggers repo analysis summary
- [ ] AI asks "Is that everything?" when materials exist
- [ ] User saying "yes" transitions to extraction_review_mode

### **Extraction Phase:**
- [ ] AI presents insights one at a time
- [ ] AI shows actual text from user's docs
- [ ] User saying "yes" to insight triggers "✅ Saved!"
- [ ] User saying "no" to insight triggers skip
- [ ] After review, AI asks "I've identified X requirements. Does that sound right?"
- [ ] Confirmed insights are chunked and stored in AlloyDB

### **Upload Flow:**
- [ ] Uploading document does NOT trigger auto-chunking
- [ ] Document is stored whole in Firestore
- [ ] Document appears in Context UI
- [ ] Extractor can read full document content later

---

## Next Steps

1. **Implement Extraction Chunking API**
   - Create endpoint for AI to chunk and store confirmed insights
   - `/api/projects/[projectId]/knowledge/chunk-insight`
   - Takes `knowledgeItemId`, `content`, `metadata` (importance, tags, etc.)

2. **Add CollectorPhaseHandoff Storage**
   - Update `/api/ai/chat` to detect checklist status
   - Store `CollectorPhaseHandoff` in `phaseData.phaseHandoffs.collector`
   - Use for analytics and debugging

3. **Extension Project Linking**
   - Add Vibn project ID to extension settings
   - Update extension to send `x-vibn-project-id` header
   - Update proxy to use explicit project ID instead of workspace path extraction

4. **Mode Transition Logic**
   - Update `resolveChatMode` to check for "is that everything?" confirmation
   - Add LLM structured output field: `readyForNextPhase: boolean`
   - Auto-transition when `readyForNextPhase === true`

---

## Architecture Alignment

This refactor aligns with the **"Why We Overhauled Vibn's Architecture"** document:

✅ **Clear, specialized phases** - Collector and Extractor now have distinct, focused jobs
✅ **Smart Handoff Protocol** - `CollectorPhaseHandoff` with checklist fields
✅ **Long-term semantic memory** - Only user-confirmed insights are chunked to AlloyDB
✅ **Structured outputs** - Checklist and handoff data is machine-readable
✅ **Better monitoring** - Handoff contracts can be logged for debugging

---

## Summary

The Collector and Extractor are now **proactive, collaborative, and smart**. Users are guided through setup, and only the content they confirm as important is chunked and stored for retrieval. This prevents bloat, increases relevance, and ensures the AI never works with irrelevant data.

**Status:** ✅ Complete and deployed (v2 prompts active)