PRD: AI Email Classification Engine¶
Introduction¶
The AI Email Classification Engine is the "Traffic Cop" of the Pebble Business Orchestrator. It monitors inbound emails, analyzes their content (subject, body, and attachments), and routes them to the appropriate business stream (CRM, ERP, or Support). It also extracts user intent (Enquiry, Complaint, PO, etc.) and maintains contextual threading to prevent duplicate lead creation.
Goals¶
- Automated Routing: Accurately segregate emails into Sales (CRM) vs Operations (ERP) streams.
- Intent Extraction: Identify specific actionable intents such as "Price Request," "Sample Requested," or "Order Status."
- Contextual Threading: Link replies and forwards to existing entities (Leads, Orders, or Tickets) to maintain a "Zero Inbox" philosophy.
- Hybrid Approach: Utilize high-speed keyword rules for common patterns and LLM (Gemini/GPT) for complex sentiment and intent analysis.
- Fail-Safe Mechanism: Default ambiguous emails to "CRM Complaint" for human review to ensure no business risk.
Use Case Mapping¶
This PRD provides the implementation blueprint for the following functional specifications: - EML-002: Classify Email to CRM Stream - EML-003: Classify Email to ERP Stream - KBN-006: Communication Re-tagging & Threads
User Stories¶
US-001: Rule-Based Stream Routing (CRM vs ERP)¶
Description: As a system, I want to classify an inbound email as 'CRM' or 'ERP' using high-speed keyword matching so it can be routed immediately.
Acceptance Criteria:
- System extracts "Clean Text" (strip HTML, remove "Re:/Fwd:" prefixes) from subject and body.
- If subject or body contains words from
CRM_KEYWORDSlist (e.g., quote, pricing, inquiry), setmetadata.stream = 'CRM'andmetadata.category = 'Enquiry'. - If subject or body contains words from
ERP_KEYWORDSlist (e.g., dispatch, tracking, invoice, PO), setmetadata.stream = 'ERP'andmetadata.category = 'Logistics'. - If both or neither match, set
metadata.stream = 'CRM'andmetadata.category = 'Needs Triage'. - Classification result is stored in the
emailstablemetadataJSONB column. - Verification: Run
npm run test:classifierand ensure 100% pass on rule-based mock emails.
US-002: Intent & Sentiment Extraction via LLM¶
Description: As a system, I want to use an LLM (Gemini/GPT) to refine "Needs Triage" emails or extract granular intent.
Acceptance Criteria:
- For all newly ingested emails, system calls LLM with prompt containing Subject + Body (max 2000 chars).
- LLM response must be a structured JSON with fields:
intent(Enquiry/Complaint/Order_Status/Sample_Request/Support),sentiment(Positive/Neutral/Negative),urgency(High/Medium/Low). - If
sentiment == NegativeANDurgency == High, override stream toCRMand category toHigh Priority Complaint. - Verification: Verify in browser by checking the "Metadata" tab on a specific email card in the dev board.
US-003: Threading & Activity Linking (EML-005)¶
Description: As a user, I want new emails from known contacts to link to their existing records to prevent board clutter.
Acceptance Criteria:
- System extracts
from_emailfrom header. - System queries
customer_contactsfor a match. - If match found and contact has an "Active" Prelead/Ticket: Link email as a new
Activityand setmetadata.linked_entity_id. Bump thelast_activity_atfor the parent record. - If no active record: Create new Prelead (as per EML-004).
- Verification: Send a reply to an existing enquiry and verify it appears in the "Activity Timeline" instead of a new card.
US-004: JSONB Metadata Schema Definition¶
Description: As a developer, I want a standardized schema for classification metadata for UI consistency.
Acceptance Criteria:
- The
metadatacolumn in theemailstable must follow this JSON structure:
{
"stream": "CRM | ERP",
"category": "Enquiry | Complaint | Logistics | Support",
"intent_tags": ["Sample_Req", "Price_Query"],
"sentiment": "Positive | Neutral | Negative",
"confidence_score": 0.0,
"threading": {
"parent_id": "UUID",
"is_reply": true
}
}
- Database constraint exists to ensure
metadatais valid JSON.
Functional Requirements¶
- FR-1: The system MUST poll connected mailboxes every 30-60 seconds (EML-001).
- FR-2: The system MUST run keyword-based matching BEFORE calling the LLM to save on latency/cost.
- FR-3: The system MUST identify Order IDs (e.g.,
PO-1234) and link them to the ERP stream (EML-003). - FR-4: The system MUST scrub "Re:" and "Fwd:" from subjects for better similarity matching.
- FR-5: In case of LLM failure/timeout, the system MUST fallback to the default triage category.
Non-Goals¶
- No automated replies to customers in the POC phase.
- No direct modification of the user's Outlook folder structure (POC only).
- No multi-language support (English only for now).
Technical Considerations¶
- LLM Choice: Gemini 1.5 Pro or GPT-4o for high-fidelity extraction.
- Persistence: Store all classification metadata in a JSONB field in the
emailstable. - Latency: Keyword rules should execute in <100ms; LLM calls should target <3s.
- Deduplication: Match by email domain and phone number (regex) as secondary identifiers.
Success Metrics¶
-
90% accuracy in CRM vs ERP routing.
- <5% "False Negative" rate (Missing a sales enquiry).
- Reduction in manual lead entry time by 80% through auto-scraping.
Open Questions¶
- Should we allow the user to manually "Correct" the AI's classification via the UI? (Planned for KBN-003).
- How do we handle emails that cc multiple entities?