Deduplication Use Cases (DUP) - ARCHIVED/FOUNDATIONS¶

Module Status: Archived as a standalone document task. Orchestrator Role: These capabilities have been repurposed for the Visual Kanban to allow users to "Hide/Stack" duplicate enquiries and track conversation history across multiple legal entities.

📖 What is Deduplication?¶

Deduplication (or "dedup") is the process of finding and removing duplicate documents from your system. Think of it like cleaning up your phone's photo gallery — finding all those duplicate photos taking up space and keeping just one copy.

Why Does It Matter?¶

Problem	Impact
Storage costs	Duplicates waste expensive storage space
Confusion	Multiple copies lead to version control nightmares
Compliance risk	Outdated duplicates may contain incorrect information
Search clutter	Finding the right document becomes harder

Real-World Example: A hospital with 100,000 patient documents might have 15% duplicates — that's 15,000 files wasting storage and causing confusion. Deduplication can reclaim that space and ensure staff always find the right document.

⚠️ What's NOT in MVP¶

[!IMPORTANT] The MVP focuses only on deduplication. The following features are planned for later phases:

Feature	Phase	Timeline
AI Chat Assistant	Phase 3	Month 6-8
Document Summarization	Phase 3	Month 6-8
Semantic Search	Phase 1	Week 7-12
Auto-Classification	Phase 1	Week 7-12
Workflow Automation	Phase 2	Month 3-6
E-Signature Integration	Phase 2	Month 3-6

📖 How Deduplication Works¶

flowchart LR
    A["📄 Document Upload"] --> B{"Same file exists?"}
    B -->|Yes| C["🔴 Exact Duplicate"]
    B -->|No| D{"Similar content?"}
    D -->|Yes| E["🟡 Near Duplicate"]
    D -->|No| F["🟢 Unique Document"]
    C --> G["📊 Dedup Report"]
    E --> G
    F --> H["✅ Store Document"]

Step-by-step:

Upload — A new document arrives in the system
Fingerprint Check — We create a digital fingerprint (hash) and check if an identical file exists
Content Comparison — If not identical, we compare the actual content for similar documents
Report — All duplicates are flagged for review or automatic cleanup

📖 Three Types of Duplicates¶

Type	What It Means	Example	How We Detect It
Exact Duplicate	Identical file, bit-for-bit	Same invoice uploaded twice	Digital fingerprint (hash)
Near Duplicate	Same content, minor differences	Invoice with updated date	Content comparison (AI)
Visual Duplicate	Same image, different size/quality	Logo resized or cropped	Image comparison

📖 Key Terms Glossary¶

Technical Term	Plain English
Hash	A digital fingerprint — a unique code generated from a file's contents
Embedding	A content signature that captures the meaning of a document
Similarity Score	How alike two documents are, expressed as a percentage (0-100%)
Perceptual Hash (pHash)	A fingerprint for images that works even if the image is resized
Vector Database	A specialized database for finding similar items quickly

🔧 Technical Use Cases¶

The following sections contain detailed technical specifications for developers.

Use Case Quick Reference¶

ID	Title	Priority
DUP-001	Compute File Hash (MD5/SHA256)	P1
DUP-002	Hide Duplicate Cards	P1
DUP-003	Compute Document Embedding	P1
DUP-004	Find Near-Duplicates	P1
DUP-005	Compute Visual Similarity (Images)	P2
DUP-006	Merge Duplicate Records	P2
DUP-007	Archive/Delete Duplicates	P2
DUP-008	Generate Dedup Report	P2

UC-DUP-001: Compute File Hash¶

Overview¶

Field	Value
ID	DUP-001
Title	Compute File Hash (MD5/SHA256)
Actor	System
Priority	P1 (MVP Phase 1)

Description¶

Calculate cryptographic hashes of uploaded files for exact duplicate detection.

Steps¶

Read file in chunks (64KB)
Update MD5 and SHA256 hash objects
Finalize and store both hashes

Output¶

{
  "hash_md5": "d41d8cd98f00b204e9800998ecf8427e",
  "hash_sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}

Acceptance Criteria¶

Hash computed for all uploaded files
Processing time <1s for 100MB files
Both MD5 and SHA256 stored

UC-DUP-002: Hide Duplicate Cards¶

Overview¶

Field	Value
ID	DUP-002
Title	Hide Duplicate Cards
Actor	System / Board UI
Priority	P1 (MVP Phase 1)

Description¶

Instead of rejecting duplicates, the system detects them and hides or stacks them in the Visual Kanban so the user sees only unique information.

Steps¶

Compute hash of incoming email/attachment.
Query database for existing match.
If Match Found:
- Mark new item with is_hidden_duplicate = true.
- Link to primary (original) item.
- UI Action: Do not spawn a new card; optionally increment "Dup Count" on original card.

Output¶

{
  "is_duplicate": true,
  "action": "hide_from_board",
  "primary_card_id": "card_123"
}

Acceptance Criteria¶

Duplicate emails do not clutter the Kanban board.
Users can toggle "Show Duplicates" filter if needed.
Original card indicates "Contains 2 duplicates".

UC-DUP-003: Compute Document Embedding¶

Overview¶

Field	Value
ID	DUP-003
Title	Compute Document Embedding
Actor	Embedding Worker
Priority	P1 (MVP Phase 2)

Description¶

Generate vector embedding from document text for semantic similarity detection.

Steps¶

Retrieve extracted text from document
Truncate/chunk if >512 tokens
Pass through embedding model
Store embedding in vector database

Model Configuration¶

Setting	Value
Model	all-MiniLM-L6-v2
Dimensions	384
Max tokens	512
Pooling	Mean

Output¶

{
  "document_id": "doc_abc",
  "embedding_id": "emb_123",
  "dimensions": 384,
  "model": "all-MiniLM-L6-v2"
}

Acceptance Criteria¶

Embeddings generated for all text-extracted documents
Stored in Qdrant for similarity search
Processing time <2s per document

UC-DUP-004: Find Near-Duplicates¶

Overview¶

Field	Value
ID	DUP-004
Title	Find Near-Duplicates
Actor	System
Priority	P1 (MVP Phase 2)

Description¶

Identify documents with similar content using embedding similarity.

Steps¶

Retrieve document embedding
Query vector store for similar documents
Apply similarity threshold (configurable)
Return ranked list of candidates

Similarity Thresholds¶

Threshold	Interpretation
>0.99	Exact content duplicate
0.95-0.99	Near-duplicate (minor edits)
0.85-0.95	Similar document
<0.85	Different document

Output¶

{
  "document_id": "doc_abc",
  "near_duplicates": [
    {
      "id": "doc_xyz",
      "similarity": 0.97,
      "title": "Invoice #123 (v2)"
    },
    {
      "id": "doc_def",
      "similarity": 0.92,
      "title": "Invoice #123 (draft)"
    }
  ]
}

Acceptance Criteria¶

Near-duplicates detected with >90% precision
Query time <100ms
Threshold is configurable

UC-DUP-005: Compute Visual Similarity¶

Overview¶

Field	Value
ID	DUP-005
Title	Compute Visual Similarity (Images)
Actor	System
Priority	P2 (MVP Phase 3)

Description¶

Detect duplicate images even with different resolutions or minor edits using perceptual hashing.

Hash Types¶

Algorithm	Use Case
pHash	Photo similarity
dHash	Difference hash (fast)
aHash	Average hash (robust)

Steps¶

Resize image to standard size (8x8 or 16x16)
Convert to grayscale
Compute perceptual hash
Store hash for comparison
Query for similar hashes (Hamming distance)

Output¶

{
  "document_id": "doc_img123",
  "phash": "8f0f0f0f0f0f0f0f",
  "similar_images": [
    {
      "id": "doc_img456",
      "hamming_distance": 2,
      "similarity": 0.97
    }
  ]
}

Acceptance Criteria¶

Detects resized duplicates
Detects cropped images
Tolerates minor quality changes

UC-DUP-006: Merge Duplicate Records¶

Overview¶

Field	Value
ID	DUP-006
Title	Merge Duplicate Records
Actor	User
Priority	P2

Description¶

Allow users to manually merge duplicate documents, combining metadata and tags.

Steps¶

User selects documents to merge
Choose primary document
Merge tags from all documents
Update references to point to primary
Archive or delete secondary documents

Acceptance Criteria¶

Tags are combined without duplicates
References are updated
Merge history is logged

UC-DUP-007: Archive or Delete Duplicates¶

Overview¶

Field	Value
ID	DUP-007
Title	Archive or Delete Duplicates
Actor	User, Admin
Priority	P2

Description¶

Remove or archive confirmed duplicate documents.

Options¶

Action	Behavior
Archive	Move to archive, retain metadata
Delete	Soft delete, recoverable
Purge	Hard delete, permanent

Acceptance Criteria¶

Duplicates can be archived
Deletion is reversible (soft delete)
Storage is reclaimed after purge

UC-DUP-008: Generate Dedup Report¶

Overview¶

Field	Value
ID	DUP-008
Title	Generate Dedup Report
Actor	User, Admin
Priority	P2

Description¶

Generate a report of detected duplicates and storage savings.

Report Contents¶

Section	Details
Summary	Total docs, duplicates found, storage saved
Exact Duplicates	List with file sizes
Near-Duplicates	List with similarity scores
Recommendations	Suggested actions

Output¶

{
  "generated_at": "2025-01-15T10:00:00Z",
  "summary": {
    "total_documents": 10000,
    "exact_duplicates": 1500,
    "near_duplicates": 800,
    "potential_savings_gb": 45.2
  },
  "exact_duplicate_groups": [...],
  "near_duplicate_groups": [...]
}

Acceptance Criteria¶

Report includes all duplicate types
Storage savings calculated
Export to CSV/PDF available

← Back to Use Cases | Previous: Ingestion | Next: Classification →