Skip to content

Observability & Operations Overview

This document details the monitoring, logging, and operational readiness strategy for Pebble Orchestrator (Pebble IQ).


Scope Breakdown (from 1,850h Estimate)

Operational Component Hours Coverage
CI/CD & Security 52h GitHub Actions, Docker builds, secret management
Backups, Monitoring, DR 52h PostgreSQL backups, Redis snapshots, Prometheus/Grafana
Reporting & Audit 72h Activity Stream, compliance logs, executive dashboards
Total Operations 176h ~9.5% of total budget

1. Audit Logging (Activity Stream)

What's Being Logged

Every business event is captured in the activity_stream table as an immutable event:

Event Type Payload Example Retention
EmailReceived {message_id, from, subject, stream} 7 years (compliance)
LeadClassified {ai_intent, confidence, manual_override} 7 years
DraftApproved {user_id, draft_id, edits_made} 7 years
TallyLookup {product, stock_returned, timestamp} 1 year
CompanyValidated {cin, pan, gstin, source} 7 years

Compliance Features

  • Who Did What: Every action tied to user_id or ai_agent_id
  • Tamper-Proof: Write-only table (no updates/deletes)
  • Searchable: Indexed by entity (Company, Email, Lead) for audit queries
  • Export: CSV/Excel export for regulators

2. MLOps (AI Model Monitoring)

Current Scope (Included in 576h AI Budget)

MLOps Component Effort Status
Training Pipeline ~40h Part of AI Intelligence (132h)
Inference Monitoring ~30h Part of AI Eval (100h)
Drift Detection ~20h Basic threshold alerts
Retraining ~30h Manual trigger for POC

What's NOT in MVP (Phase 4)

  • Automated drift detection (Evidently AI)
  • A/B testing infrastructure (multiple model versions)
  • Feature store (centralized entity embeddings)
  • Auto-retraining pipelines

Key MVP Metrics (Prometheus)

# Email Classification
pebble_ai_classification_latency_seconds
pebble_ai_classification_confidence_score
pebble_ai_classification_override_rate

# Tally Integration (Phase 1)
pebble_tally_lookup_success_rate
pebble_tally_lookup_latency_seconds

# LangGraph Agents (Phase 1)
pebble_draft_generation_latency_seconds
pebble_draft_approval_rate

3. DevOps & Infrastructure

CI/CD Pipeline (52h)

Includes: - GitHub Actions workflows (test → build → deploy) - Docker multi-stage builds - Staging vs Production environments - Automated DB migrations (Alembic) - Secret rotation (Vault or GH Secrets)

Example Workflow:

# .github/workflows/ci.yml
on: [push]
jobs:
  test:
    - pytest --cov=80%
  build:
    - docker build -t pebble-api:$SHA
  deploy-staging:
    if: branch == develop
    - kubectl apply -f k8s/staging/

Monitoring & Alerting (52h)

Stack: - Metrics: Prometheus + Grafana - Logs: Loki (or ELK) - APM: Sentry (error tracking) - Uptime: UptimeRobot (external)

Key Dashboards: 1. Email Pipeline: Emails ingested/hour, classification rate, error rate 2. Tally Health (Phase 1): XML failures, stock lookup latency 3. CRM Performance: API p95 latency, DB connection pool 4. LangGraph Agents (Phase 1): Draft generation success rate, human override %

Disaster Recovery (Included in 52h)

Asset Backup Frequency Recovery Time
PostgreSQL Every 6 hours < 1 hour
Redis Daily snapshot < 30 min
Email Attachments (MinIO) Daily < 2 hours
Activity Stream Every 1 hour < 30 min (critical)

4. Reporting & Executive Dashboards (72h)

Operational Dashboards

Sales Ops Dashboard - Emails processed today/week - AI classification accuracy (% overrides) - Lead conversion funnel - Staff response time (P50, P95)

Owner Dashboard (Phase 1 - Tally) - Draft approval rate by staff member - Most common customer queries - Tally sync health

Compliance Reports

Quarterly Audit Export - All AI decisions (classification + drafts) - Manual overrides with justification - Failed validations (CIN/PAN/GST)

GDPR/Data Subject Requests - Export all data for a given email/company - Delete request workflow (soft-delete with audit trail)


5. Security & Access Control

Authentication (Included in CRM 184h)

  • JWT-based auth
  • Role-based access (Admin, Sales, Ops, Read-Only)
  • Session timeout (30 min inactivity)

Security Hardening (52h Operations)

  • SQL injection prevention (parameterized queries)
  • CORS policy for CRM UI
  • Rate limiting (100 req/min per user)
  • Secrets in env vars (never in code)

Penetration Testing (Included in QA 240h)

  • OWASP Top 10 checklist
  • Third-party pentest (if budget allows)
  • Vulnerability scanning (Snyk/Trivy)

6. What's NOT Included (Future Enhancements)

Feature Phase Reasoning
Kubernetes Orchestration Post-MVP Docker Compose sufficient for POC
Multi-Region Deployment Phase 3 Single India region for now
Advanced MLOps (Feature Store) Phase 4 Manual retraining OK for POC
Real-Time Anomaly Detection Phase 4 Threshold alerts sufficient
Chaos Engineering Post-MVP Not critical for 8-week POC

7. Cost Monitoring (Not in Scope)

Note: The 1,850h estimate does NOT include ongoing operational costs: - Cloud hosting (AWS/Azure/GCP) - LLM API costs (Azure OpenAI tokens) - Third-party SaaS (Sentry, monitoring tools)

Estimated Monthly Run Cost (Phase 1): - Compute: $200-500/month (2-3 VMs) - Database: $100/month (managed PostgreSQL) - LLM API: $50-200/month (depends on email volume) - Total: ~$500-1000/month


Summary: Is 176h Enough for Ops?

Yes for MVP, because: - We're not building Kubernetes (Docker Compose OK) - Activity Stream is simple write-only table - Prometheus + Grafana are OSS (low setup time) - Disaster recovery is PostgreSQL point-in-time recovery

But Phase 1 (Tally) may need +20-30h for: - Tally sync monitoring dashboard - Draft approval analytics - Staff performance tracking (who's using AI drafts?)


← Back to Home | View Architecture