vibn-frontend/COLLECTOR_TO_EXTRACTION_FLOW.md

# Collector → Extraction Flow: Dependency Order

## Overview

This document explains the **exact order of operations** when a user completes the Collector phase and transitions to Extraction Review.

---

## Phase Flow Diagram

```
User says "that's everything"
         ↓
[1] AI detects readiness
         ↓
[2] Handoff persisted to Firestore
         ↓
[3] Backend extraction triggered (async)
         ↓
[4] Phase transitions to extraction_review
         ↓
[5] Mode resolver detects new phase
         ↓
[6] AI responds in extraction_review_mode
```

---

## Detailed Step-by-Step

### **Step 1: User Confirmation**

**Trigger:** User sends message like:
- "that's everything"
- "yes, analyze now"
- "I'm ready"

**What happens:**
- Message goes to `/api/ai/chat` POST handler
- LLM is called with full conversation history
- LLM returns structured response with `collectorHandoff` object

**Location:** `/app/api/ai/chat/route.ts`, lines 154-180

---

### **Step 2: Handoff Detection**

**Dependencies:**
- AI's `reply.collectorHandoff?.readyForExtraction` OR
- Fallback: AI's reply text contains trigger phrases

**What happens:**

```typescript
// Primary: Check structured output
let readyForExtraction = reply.collectorHandoff?.readyForExtraction ?? false;

// Fallback: Check reply text for phrases like "Perfect! Let me analyze"
if (!readyForExtraction && reply.reply) {
  const confirmPhrases = [
    'perfect! let me analyze',
    'perfect! i\'m starting',
    // ... etc
  ];
  const replyLower = reply.reply.toLowerCase();
  readyForExtraction = confirmPhrases.some(phrase => replyLower.includes(phrase));
}
```

**Location:** `/app/api/ai/chat/route.ts`, lines 191-210

**Critical:** If this doesn't detect readiness, the flow STOPS here.

---

### **Step 3: Build and Persist Collector Handoff**

**Dependencies:**
- `readyForExtraction === true` (from Step 2)
- Project context data (documents, GitHub, extension status)

**What happens:**

```typescript
const handoff: CollectorPhaseHandoff = {
  phase: 'collector',
  readyForNextPhase: readyForExtraction,  // Must be true!
  confidence: readyForExtraction ? 0.9 : 0.5,
  confirmed: {
    hasDocuments: (context.knowledgeSummary.bySourceType['imported_document'] ?? 0) > 0,
    documentCount: context.knowledgeSummary.bySourceType['imported_document'] ?? 0,
    githubConnected: !!context.project.githubRepo,
    githubRepo: context.project.githubRepo,
    extensionLinked: context.project.extensionLinked ?? false,
  },
  // ... etc
};

// Persist to Firestore
await adminDb.collection('projects').doc(projectId).set(
  { 'phaseData.phaseHandoffs.collector': handoff },
  { merge: true }
);
```

**Location:** `/app/api/ai/chat/route.ts`, lines 212-242

**Data written:**
- `projects/{projectId}/phaseData.phaseHandoffs.collector`
  - `readyForNextPhase: true`
  - `confirmed: { hasDocuments, githubConnected, extensionLinked }`

---

### **Step 4: Mark Collector Complete**

**Dependencies:**
- `handoff.readyForNextPhase === true` (from Step 3)

**What happens:**

```typescript
if (handoff.readyForNextPhase) {
  console.log(`[AI Chat] Collector complete - triggering backend extraction`);

  // Mark collector as complete
  await adminDb.collection('projects').doc(projectId).update({
    'phaseData.collectorCompletedAt': new Date().toISOString(),
  });

  // ... (Step 5 happens next)
}
```

**Location:** `/app/api/ai/chat/route.ts`, lines 252-260

**Data written:**
- `projects/{projectId}/phaseData.collectorCompletedAt` = timestamp

---

### **Step 5: Trigger Backend Extraction (Async)**

**Dependencies:**
- Collector marked complete (from Step 4)

**What happens:**

```typescript
// Trigger backend extraction (async - don't await)
import('@/lib/server/backend-extractor').then(({ runBackendExtractionForProject }) => {
  runBackendExtractionForProject(projectId).catch((error) => {
    console.error(`[AI Chat] Backend extraction failed for project ${projectId}:`, error);
  });
});
```

**Location:** `/app/api/ai/chat/route.ts`, lines 263-267

**Critical:** This is **asynchronous** - the chat response returns BEFORE extraction completes!

---

### **Step 6: Backend Extraction Runs**

**Dependencies:**
- Called from Step 5

**What happens:**

1. **Load project data**
   ```typescript
   const projectDoc = await adminDb.collection('projects').doc(projectId).get();
   const projectData = projectDoc.data();
   ```

2. **Load knowledge_items (documents)**
   ```typescript
   const knowledgeSnapshot = await adminDb
     .collection('knowledge_items')
     .where('projectId', '==', projectId)
     .where('sourceType', '==', 'imported_document')
     .get();
   ```

3. **Check if empty:**
   - **If NO documents:** Create empty handoff, skip to Step 6d
   - **If HAS documents:** Process each document (call LLM, extract insights, write chunks)

4. **Build extraction handoff:**
   ```typescript
   const extractionHandoff: PhaseHandoff = {
     phase: 'extraction',
     readyForNextPhase: boolean,  // true if insights found, false if no docs
     confidence: number,
     confirmed: { problems, targetUsers, features, constraints, opportunities },
     missing: [...],
     questionsForUser: [...],
     // ...
   };
   ```

5. **Persist extraction handoff and transition phase:**
   ```typescript
   await adminDb.collection('projects').doc(projectId).update({
     'phaseData.phaseHandoffs.extraction': extractionHandoff,
     currentPhase: 'extraction_review',  // ← PHASE TRANSITION!
     phaseStatus: 'in_progress',
     'phaseData.extractionCompletedAt': new Date().toISOString(),
   });
   ```

**Location:** `/lib/server/backend-extractor.ts`, entire file

**Data written:**
- `projects/{projectId}/currentPhase` = `"extraction_review"`
- `projects/{projectId}/phaseData.phaseHandoffs.extraction` = extraction results
- `chat_extractions/{id}` = per-document extraction data (if documents exist)
- `knowledge_chunks` (AlloyDB) = vectorized insights (if documents exist)

**Duration:** Could take 5-60 seconds depending on document count and size

---

### **Step 7: User Sends Next Message**

**Dependencies:**
- User sends a new message (e.g., "what did you find?")

**What happens:**

1. **Mode resolver is called:**
   ```typescript
   const resolvedMode = await resolveChatMode(projectId);
   ```

2. **Mode resolver logic (CRITICAL ORDER):**
   ```typescript
   // PRIORITY: Check explicit phase transitions FIRST
   if (projectData.currentPhase === 'extraction_review' ||
       projectData.currentPhase === 'analyzed') {
     return 'extraction_review_mode';  // ← Returns this!
   }

   // These checks are skipped because phase already transitioned:
   if (!hasKnowledge) {
     return 'collector_mode';
   }
   if (hasKnowledge && !hasExtractions) {
     return 'collector_mode';
   }
   ```

3. **Context builder loads extraction data:**
   ```typescript
   if (mode === 'extraction_review_mode') {
     context.phaseData.phaseHandoffs.extraction = ...;
     context.extractionSummary = ...;
     // Does NOT load raw documents
   }
   ```

4. **System prompt selected:**
   ```typescript
   const systemPrompt = EXTRACTION_REVIEW_V2.prompt;
   // Instructs AI to:
   // - NOT say "processing"
   // - Present extraction results
   // - Ask clarifying questions
   ```

5. **AI responds in extraction_review_mode**

**Location:**
- `/lib/server/chat-mode-resolver.ts` (mode resolution)
- `/lib/server/chat-context.ts` (context building)
- `/lib/ai/prompts/extraction-review.ts` (system prompt)

---

## Critical Dependencies

### **For handoff to trigger:**
1. ✅ AI must return `readyForExtraction: true` OR say trigger phrase
2. ✅ Firestore must persist `phaseData.phaseHandoffs.collector`

### **For backend extraction to run:**
1. ✅ `handoff.readyForNextPhase === true`
2. ✅ `runBackendExtractionForProject()` must be called

### **For phase transition:**
1. ✅ Backend extraction must complete successfully
2. ✅ Firestore must write `currentPhase: 'extraction_review'`

### **For mode to switch to extraction_review:**
1. ✅ `currentPhase === 'extraction_review'` in Firestore
2. ✅ Mode resolver must check `currentPhase` BEFORE checking `hasKnowledge`

### **For AI to stop hallucinating:**
1. ✅ Mode must be `extraction_review_mode` (not `collector_mode`)
2. ✅ System prompt must be `EXTRACTION_REVIEW_V2`
3. ✅ Context must include `phaseData.phaseHandoffs.extraction`

---

## What Can Go Wrong?

### **Issue 1: Handoff doesn't trigger**
- **Symptom:** AI keeps asking for more materials
- **Cause:** `readyForExtraction` is false
- **Fix:** Check fallback phrase detection is working

### **Issue 2: Backend extraction exits early**
- **Symptom:** Phase stays as `collector`, no extraction handoff
- **Cause:** No documents uploaded, empty handoff not created
- **Fix:** Ensure empty handoff logic runs (lines 58-93 in `backend-extractor.ts`)

### **Issue 3: Mode stays as `collector_mode`**
- **Symptom:** `projectPhase: "extraction_review"` but `mode: "collector_mode"`
- **Cause:** Mode resolver checking `!hasKnowledge` before `currentPhase`
- **Fix:** Reorder mode resolver logic (priority to `currentPhase`)

### **Issue 4: AI still says "processing"**
- **Symptom:** AI says "I'm analyzing..." in extraction_review
- **Cause:** Wrong system prompt being used
- **Fix:** Verify mode is `extraction_review_mode`, not `collector_mode`

---

## Testing Checklist

To verify the full flow works:

1. ✅ Create new project
2. ✅ AI welcomes user with collector checklist
3. ✅ User connects GitHub OR uploads docs
4. ✅ User says "that's everything"
5. ✅ Check Firestore: `phaseHandoffs.collector.readyForNextPhase === true`
6. ✅ Wait 5 seconds for async extraction
7. ✅ Check Firestore: `currentPhase === "extraction_review"`
8. ✅ Check Firestore: `phaseHandoffs.extraction` exists
9. ✅ User sends message: "what did you find?"
10. ✅ API returns `mode: "extraction_review_mode"`
11. ✅ AI presents extraction results (or asks for missing info)
12. ✅ AI does NOT say "processing" or "analyzing"

---

## Date

November 17, 2025