vibn-frontend/UPLOAD_CHUNKING_REMOVED.md

# Document Upload - Chunking Removed ✅

## Issue Found
Despite the Collector/Extractor refactor, document uploads were still auto-chunking files into semantic pieces.

## What Was Happening (Before)
```typescript
// upload-document/route.ts
const chunks = chunkDocument(content, {
  maxChunkSize: 2000,
  chunkOverlap: 200,
});

for (const chunk of chunks) {
  await createKnowledgeItem({
    title: `${file.name} (chunk ${i}/${total})`,
    content: chunk.content,
  });
}
```

**Result:**
- 1 file upload → 5-10 separate knowledge_items
- Each chunk stored as separate record
- Auto-chunking contradicted Extractor AI's collaborative approach

## What Happens Now (After)
```typescript
// upload-document/route.ts
const knowledgeItem = await createKnowledgeItem({
  title: file.name,
  content: content, // Whole document
  sourceMeta: {
    tags: ['document', 'uploaded', 'pending_extraction'],
  },
});
```

**Result:**
- 1 file upload → 1 knowledge_item
- Whole document stored intact
- Tagged as `pending_extraction`
- Extractor AI will review and collaboratively chunk

---

## Files Changed

### 1. `app/api/projects/[projectId]/knowledge/upload-document/route.ts`

**Removed:**
- `chunkDocument()` import and calls
- Loop creating multiple knowledge_items
- Chunk metadata tracking

**Added:**
- Single knowledge_item creation with full content
- `pending_extraction` tag
- Status tracking in contextSources

**Before:**
```typescript
const chunks = chunkDocument(content, {...});
for (const chunk of chunks) {
  const knowledgeItem = await createKnowledgeItem({
    title: `${file.name} (chunk ${i}/${total})`,
    content: chunk.content,
  });
}
```

**After:**
```typescript
const knowledgeItem = await createKnowledgeItem({
  title: file.name,
  content: content, // Whole document
  sourceMeta: {
    tags: ['pending_extraction'],
  },
});
```

### 2. `app/[workspace]/project/[projectId]/context/page.tsx`

**Changed UI text:**
- **Before:** "Documents will be automatically chunked and processed for AI context."
- **After:** "Documents will be stored for the Extractor AI to review and process."

---

## User Experience Changes

### Upload Flow (Now):
1. User uploads `project-spec.md`
2. File saved to Firebase Storage
3. **Whole document** stored as 1 knowledge_item
4. Appears in Context page as "project-spec.md"
5. Tagged `pending_extraction`

### Extraction Flow (Later):
1. User says "Is that everything?" → AI transitions
2. Extractor AI mode activates
3. AI reads whole documents
4. AI asks: "I see this section about user roles - is this important for V1?"
5. User confirms: "Yes, that's critical"
6. AI calls `/api/projects/{id}/knowledge/chunk-insight`
7. Creates targeted chunk as `extracted_insight`
8. Chunks stored in AlloyDB for retrieval

---

## Why This Matters

### Before (Auto-chunking):
- ❌ System guessed what's important
- ❌ Over-chunked irrelevant sections
- ❌ Polluted vector database with noise
- ❌ User had no control

### After (Collaborative):
- ✅ Extractor AI asks before chunking
- ✅ Only important sections chunked
- ✅ User confirms what matters for V1
- ✅ Clean, relevant vector database

---

## API Response Changes

### Before:
```json
{
  "success": true,
  "chunkCount": 8,
  "knowledgeItemIds": ["id1", "id2", "id3", ...]
}
```

### After:
```json
{
  "success": true,
  "knowledgeItemId": "single_id",
  "status": "stored",
  "message": "Document stored. Extractor AI will review and chunk important sections."
}
```

---

## Database Structure

### Firestore - knowledge_items:
```json
{
  "id": "abc123",
  "projectId": "proj456",
  "sourceType": "imported_document",
  "title": "project-spec.md",
  "content": "< FULL DOCUMENT CONTENT >",
  "sourceMeta": {
    "filename": "project-spec.md",
    "tags": ["document", "uploaded", "pending_extraction"],
    "url": "https://storage.googleapis.com/..."
  }
}
```

### Firestore - contextSources:
```json
{
  "type": "document",
  "name": "project-spec.md",
  "summary": "Document (5423 characters) - pending extraction",
  "metadata": {
    "knowledgeItemId": "abc123",
    "status": "pending_extraction"
  }
}
```

---

## Testing Checklist

- [x] Remove chunking logic from upload endpoint
- [x] Update UI text to reflect new behavior
- [x] Verify whole document is stored
- [x] Confirm `pending_extraction` tag is set
- [ ] Test document upload with 3 files
- [ ] Verify Collector checklist updates
- [ ] Test Extractor AI reads full documents
- [ ] Test `/chunk-insight` API creates extracted chunks

---

## Related Documentation

- `TABLE_STAKES_IMPLEMENTATION.md` - Full feature implementation
- `COLLECTOR_EXTRACTOR_REFACTOR.md` - Refactor rationale
- `QA_FIXES_APPLIED.md` - QA testing results

---

## Status

✅ **Auto-chunking removed**
✅ **UI text updated**
✅ **Server restarted**
🔄 **Ready for testing**

The upload flow now correctly stores whole documents and defers chunking to the collaborative Extractor AI phase.