VIBN Frontend for Coolify deployment
This commit is contained in:
213
UPLOAD_CHUNKING_REMOVED.md
Normal file
213
UPLOAD_CHUNKING_REMOVED.md
Normal file
@@ -0,0 +1,213 @@
|
||||
# Document Upload - Chunking Removed ✅
|
||||
|
||||
## Issue Found
|
||||
Despite the Collector/Extractor refactor, document uploads were still auto-chunking files into semantic pieces.
|
||||
|
||||
## What Was Happening (Before)
|
||||
```typescript
|
||||
// upload-document/route.ts
|
||||
const chunks = chunkDocument(content, {
|
||||
maxChunkSize: 2000,
|
||||
chunkOverlap: 200,
|
||||
});
|
||||
|
||||
for (const chunk of chunks) {
|
||||
await createKnowledgeItem({
|
||||
title: `${file.name} (chunk ${i}/${total})`,
|
||||
content: chunk.content,
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
**Result:**
|
||||
- 1 file upload → 5-10 separate knowledge_items
|
||||
- Each chunk stored as separate record
|
||||
- Auto-chunking contradicted Extractor AI's collaborative approach
|
||||
|
||||
## What Happens Now (After)
|
||||
```typescript
|
||||
// upload-document/route.ts
|
||||
const knowledgeItem = await createKnowledgeItem({
|
||||
title: file.name,
|
||||
content: content, // Whole document
|
||||
sourceMeta: {
|
||||
tags: ['document', 'uploaded', 'pending_extraction'],
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
**Result:**
|
||||
- 1 file upload → 1 knowledge_item
|
||||
- Whole document stored intact
|
||||
- Tagged as `pending_extraction`
|
||||
- Extractor AI will review and collaboratively chunk
|
||||
|
||||
---
|
||||
|
||||
## Files Changed
|
||||
|
||||
### 1. `app/api/projects/[projectId]/knowledge/upload-document/route.ts`
|
||||
|
||||
**Removed:**
|
||||
- `chunkDocument()` import and calls
|
||||
- Loop creating multiple knowledge_items
|
||||
- Chunk metadata tracking
|
||||
|
||||
**Added:**
|
||||
- Single knowledge_item creation with full content
|
||||
- `pending_extraction` tag
|
||||
- Status tracking in contextSources
|
||||
|
||||
**Before:**
|
||||
```typescript
|
||||
const chunks = chunkDocument(content, {...});
|
||||
for (const chunk of chunks) {
|
||||
const knowledgeItem = await createKnowledgeItem({
|
||||
title: `${file.name} (chunk ${i}/${total})`,
|
||||
content: chunk.content,
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
**After:**
|
||||
```typescript
|
||||
const knowledgeItem = await createKnowledgeItem({
|
||||
title: file.name,
|
||||
content: content, // Whole document
|
||||
sourceMeta: {
|
||||
tags: ['pending_extraction'],
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
### 2. `app/[workspace]/project/[projectId]/context/page.tsx`
|
||||
|
||||
**Changed UI text:**
|
||||
- **Before:** "Documents will be automatically chunked and processed for AI context."
|
||||
- **After:** "Documents will be stored for the Extractor AI to review and process."
|
||||
|
||||
---
|
||||
|
||||
## User Experience Changes
|
||||
|
||||
### Upload Flow (Now):
|
||||
1. User uploads `project-spec.md`
|
||||
2. File saved to Firebase Storage
|
||||
3. **Whole document** stored as 1 knowledge_item
|
||||
4. Appears in Context page as "project-spec.md"
|
||||
5. Tagged `pending_extraction`
|
||||
|
||||
### Extraction Flow (Later):
|
||||
1. User says "Is that everything?" → AI transitions
|
||||
2. Extractor AI mode activates
|
||||
3. AI reads whole documents
|
||||
4. AI asks: "I see this section about user roles - is this important for V1?"
|
||||
5. User confirms: "Yes, that's critical"
|
||||
6. AI calls `/api/projects/{id}/knowledge/chunk-insight`
|
||||
7. Creates targeted chunk as `extracted_insight`
|
||||
8. Chunks stored in AlloyDB for retrieval
|
||||
|
||||
---
|
||||
|
||||
## Why This Matters
|
||||
|
||||
### Before (Auto-chunking):
|
||||
- ❌ System guessed what's important
|
||||
- ❌ Over-chunked irrelevant sections
|
||||
- ❌ Polluted vector database with noise
|
||||
- ❌ User had no control
|
||||
|
||||
### After (Collaborative):
|
||||
- ✅ Extractor AI asks before chunking
|
||||
- ✅ Only important sections chunked
|
||||
- ✅ User confirms what matters for V1
|
||||
- ✅ Clean, relevant vector database
|
||||
|
||||
---
|
||||
|
||||
## API Response Changes
|
||||
|
||||
### Before:
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"chunkCount": 8,
|
||||
"knowledgeItemIds": ["id1", "id2", "id3", ...]
|
||||
}
|
||||
```
|
||||
|
||||
### After:
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"knowledgeItemId": "single_id",
|
||||
"status": "stored",
|
||||
"message": "Document stored. Extractor AI will review and chunk important sections."
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Database Structure
|
||||
|
||||
### Firestore - knowledge_items:
|
||||
```json
|
||||
{
|
||||
"id": "abc123",
|
||||
"projectId": "proj456",
|
||||
"sourceType": "imported_document",
|
||||
"title": "project-spec.md",
|
||||
"content": "< FULL DOCUMENT CONTENT >",
|
||||
"sourceMeta": {
|
||||
"filename": "project-spec.md",
|
||||
"tags": ["document", "uploaded", "pending_extraction"],
|
||||
"url": "https://storage.googleapis.com/..."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Firestore - contextSources:
|
||||
```json
|
||||
{
|
||||
"type": "document",
|
||||
"name": "project-spec.md",
|
||||
"summary": "Document (5423 characters) - pending extraction",
|
||||
"metadata": {
|
||||
"knowledgeItemId": "abc123",
|
||||
"status": "pending_extraction"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Checklist
|
||||
|
||||
- [x] Remove chunking logic from upload endpoint
|
||||
- [x] Update UI text to reflect new behavior
|
||||
- [x] Verify whole document is stored
|
||||
- [x] Confirm `pending_extraction` tag is set
|
||||
- [ ] Test document upload with 3 files
|
||||
- [ ] Verify Collector checklist updates
|
||||
- [ ] Test Extractor AI reads full documents
|
||||
- [ ] Test `/chunk-insight` API creates extracted chunks
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- `TABLE_STAKES_IMPLEMENTATION.md` - Full feature implementation
|
||||
- `COLLECTOR_EXTRACTOR_REFACTOR.md` - Refactor rationale
|
||||
- `QA_FIXES_APPLIED.md` - QA testing results
|
||||
|
||||
---
|
||||
|
||||
## Status
|
||||
|
||||
✅ **Auto-chunking removed**
|
||||
✅ **UI text updated**
|
||||
✅ **Server restarted**
|
||||
🔄 **Ready for testing**
|
||||
|
||||
The upload flow now correctly stores whole documents and defers chunking to the collaborative Extractor AI phase.
|
||||
|
||||
Reference in New Issue
Block a user