Files
vibn-frontend/UPLOAD_CHUNKING_REMOVED.md

4.9 KiB

Document Upload - Chunking Removed

Issue Found

Despite the Collector/Extractor refactor, document uploads were still auto-chunking files into semantic pieces.

What Was Happening (Before)

// upload-document/route.ts
const chunks = chunkDocument(content, {
  maxChunkSize: 2000,
  chunkOverlap: 200,
});

for (const chunk of chunks) {
  await createKnowledgeItem({
    title: `${file.name} (chunk ${i}/${total})`,
    content: chunk.content,
  });
}

Result:

  • 1 file upload → 5-10 separate knowledge_items
  • Each chunk stored as separate record
  • Auto-chunking contradicted Extractor AI's collaborative approach

What Happens Now (After)

// upload-document/route.ts
const knowledgeItem = await createKnowledgeItem({
  title: file.name,
  content: content, // Whole document
  sourceMeta: {
    tags: ['document', 'uploaded', 'pending_extraction'],
  },
});

Result:

  • 1 file upload → 1 knowledge_item
  • Whole document stored intact
  • Tagged as pending_extraction
  • Extractor AI will review and collaboratively chunk

Files Changed

1. app/api/projects/[projectId]/knowledge/upload-document/route.ts

Removed:

  • chunkDocument() import and calls
  • Loop creating multiple knowledge_items
  • Chunk metadata tracking

Added:

  • Single knowledge_item creation with full content
  • pending_extraction tag
  • Status tracking in contextSources

Before:

const chunks = chunkDocument(content, {...});
for (const chunk of chunks) {
  const knowledgeItem = await createKnowledgeItem({
    title: `${file.name} (chunk ${i}/${total})`,
    content: chunk.content,
  });
}

After:

const knowledgeItem = await createKnowledgeItem({
  title: file.name,
  content: content, // Whole document
  sourceMeta: {
    tags: ['pending_extraction'],
  },
});

2. app/[workspace]/project/[projectId]/context/page.tsx

Changed UI text:

  • Before: "Documents will be automatically chunked and processed for AI context."
  • After: "Documents will be stored for the Extractor AI to review and process."

User Experience Changes

Upload Flow (Now):

  1. User uploads project-spec.md
  2. File saved to Firebase Storage
  3. Whole document stored as 1 knowledge_item
  4. Appears in Context page as "project-spec.md"
  5. Tagged pending_extraction

Extraction Flow (Later):

  1. User says "Is that everything?" → AI transitions
  2. Extractor AI mode activates
  3. AI reads whole documents
  4. AI asks: "I see this section about user roles - is this important for V1?"
  5. User confirms: "Yes, that's critical"
  6. AI calls /api/projects/{id}/knowledge/chunk-insight
  7. Creates targeted chunk as extracted_insight
  8. Chunks stored in AlloyDB for retrieval

Why This Matters

Before (Auto-chunking):

  • System guessed what's important
  • Over-chunked irrelevant sections
  • Polluted vector database with noise
  • User had no control

After (Collaborative):

  • Extractor AI asks before chunking
  • Only important sections chunked
  • User confirms what matters for V1
  • Clean, relevant vector database

API Response Changes

Before:

{
  "success": true,
  "chunkCount": 8,
  "knowledgeItemIds": ["id1", "id2", "id3", ...]
}

After:

{
  "success": true,
  "knowledgeItemId": "single_id",
  "status": "stored",
  "message": "Document stored. Extractor AI will review and chunk important sections."
}

Database Structure

Firestore - knowledge_items:

{
  "id": "abc123",
  "projectId": "proj456",
  "sourceType": "imported_document",
  "title": "project-spec.md",
  "content": "< FULL DOCUMENT CONTENT >",
  "sourceMeta": {
    "filename": "project-spec.md",
    "tags": ["document", "uploaded", "pending_extraction"],
    "url": "https://storage.googleapis.com/..."
  }
}

Firestore - contextSources:

{
  "type": "document",
  "name": "project-spec.md",
  "summary": "Document (5423 characters) - pending extraction",
  "metadata": {
    "knowledgeItemId": "abc123",
    "status": "pending_extraction"
  }
}

Testing Checklist

  • Remove chunking logic from upload endpoint
  • Update UI text to reflect new behavior
  • Verify whole document is stored
  • Confirm pending_extraction tag is set
  • Test document upload with 3 files
  • Verify Collector checklist updates
  • Test Extractor AI reads full documents
  • Test /chunk-insight API creates extracted chunks

  • TABLE_STAKES_IMPLEMENTATION.md - Full feature implementation
  • COLLECTOR_EXTRACTOR_REFACTOR.md - Refactor rationale
  • QA_FIXES_APPLIED.md - QA testing results

Status

Auto-chunking removed UI text updated Server restarted 🔄 Ready for testing

The upload flow now correctly stores whole documents and defers chunking to the collaborative Extractor AI phase.