Observability & Operations Overview¶
This document details the monitoring, logging, and operational readiness strategy for Pebble Orchestrator (Pebble IQ).
Scope Breakdown (from 1,850h Estimate)¶
| Operational Component | Hours | Coverage |
|---|---|---|
| CI/CD & Security | 52h | GitHub Actions, Docker builds, secret management |
| Backups, Monitoring, DR | 52h | PostgreSQL backups, Redis snapshots, Prometheus/Grafana |
| Reporting & Audit | 72h | Activity Stream, compliance logs, executive dashboards |
| Total Operations | 176h | ~9.5% of total budget |
1. Audit Logging (Activity Stream)¶
What's Being Logged¶
Every business event is captured in the activity_stream table as an immutable event:
| Event Type | Payload Example | Retention |
|---|---|---|
EmailReceived |
{message_id, from, subject, stream} |
7 years (compliance) |
LeadClassified |
{ai_intent, confidence, manual_override} |
7 years |
DraftApproved |
{user_id, draft_id, edits_made} |
7 years |
TallyLookup |
{product, stock_returned, timestamp} |
1 year |
CompanyValidated |
{cin, pan, gstin, source} |
7 years |
Compliance Features¶
- Who Did What: Every action tied to
user_idorai_agent_id - Tamper-Proof: Write-only table (no updates/deletes)
- Searchable: Indexed by entity (Company, Email, Lead) for audit queries
- Export: CSV/Excel export for regulators
2. MLOps (AI Model Monitoring)¶
Current Scope (Included in 576h AI Budget)¶
| MLOps Component | Effort | Status |
|---|---|---|
| Training Pipeline | ~40h | Part of AI Intelligence (132h) |
| Inference Monitoring | ~30h | Part of AI Eval (100h) |
| Drift Detection | ~20h | Basic threshold alerts |
| Retraining | ~30h | Manual trigger for POC |
What's NOT in MVP (Phase 4)¶
- Automated drift detection (Evidently AI)
- A/B testing infrastructure (multiple model versions)
- Feature store (centralized entity embeddings)
- Auto-retraining pipelines
Key MVP Metrics (Prometheus)¶
# Email Classification
pebble_ai_classification_latency_seconds
pebble_ai_classification_confidence_score
pebble_ai_classification_override_rate
# Tally Integration (Phase 1)
pebble_tally_lookup_success_rate
pebble_tally_lookup_latency_seconds
# LangGraph Agents (Phase 1)
pebble_draft_generation_latency_seconds
pebble_draft_approval_rate
3. DevOps & Infrastructure¶
CI/CD Pipeline (52h)¶
Includes: - GitHub Actions workflows (test → build → deploy) - Docker multi-stage builds - Staging vs Production environments - Automated DB migrations (Alembic) - Secret rotation (Vault or GH Secrets)
Example Workflow:
# .github/workflows/ci.yml
on: [push]
jobs:
test:
- pytest --cov=80%
build:
- docker build -t pebble-api:$SHA
deploy-staging:
if: branch == develop
- kubectl apply -f k8s/staging/
Monitoring & Alerting (52h)¶
Stack: - Metrics: Prometheus + Grafana - Logs: Loki (or ELK) - APM: Sentry (error tracking) - Uptime: UptimeRobot (external)
Key Dashboards: 1. Email Pipeline: Emails ingested/hour, classification rate, error rate 2. Tally Health (Phase 1): XML failures, stock lookup latency 3. CRM Performance: API p95 latency, DB connection pool 4. LangGraph Agents (Phase 1): Draft generation success rate, human override %
Disaster Recovery (Included in 52h)¶
| Asset | Backup Frequency | Recovery Time |
|---|---|---|
| PostgreSQL | Every 6 hours | < 1 hour |
| Redis | Daily snapshot | < 30 min |
| Email Attachments (MinIO) | Daily | < 2 hours |
| Activity Stream | Every 1 hour | < 30 min (critical) |
4. Reporting & Executive Dashboards (72h)¶
Operational Dashboards¶
Sales Ops Dashboard - Emails processed today/week - AI classification accuracy (% overrides) - Lead conversion funnel - Staff response time (P50, P95)
Owner Dashboard (Phase 1 - Tally) - Draft approval rate by staff member - Most common customer queries - Tally sync health
Compliance Reports¶
Quarterly Audit Export - All AI decisions (classification + drafts) - Manual overrides with justification - Failed validations (CIN/PAN/GST)
GDPR/Data Subject Requests - Export all data for a given email/company - Delete request workflow (soft-delete with audit trail)
5. Security & Access Control¶
Authentication (Included in CRM 184h)¶
- JWT-based auth
- Role-based access (Admin, Sales, Ops, Read-Only)
- Session timeout (30 min inactivity)
Security Hardening (52h Operations)¶
- SQL injection prevention (parameterized queries)
- CORS policy for CRM UI
- Rate limiting (100 req/min per user)
- Secrets in env vars (never in code)
Penetration Testing (Included in QA 240h)¶
- OWASP Top 10 checklist
- Third-party pentest (if budget allows)
- Vulnerability scanning (Snyk/Trivy)
6. What's NOT Included (Future Enhancements)¶
| Feature | Phase | Reasoning |
|---|---|---|
| Kubernetes Orchestration | Post-MVP | Docker Compose sufficient for POC |
| Multi-Region Deployment | Phase 3 | Single India region for now |
| Advanced MLOps (Feature Store) | Phase 4 | Manual retraining OK for POC |
| Real-Time Anomaly Detection | Phase 4 | Threshold alerts sufficient |
| Chaos Engineering | Post-MVP | Not critical for 8-week POC |
7. Cost Monitoring (Not in Scope)¶
Note: The 1,850h estimate does NOT include ongoing operational costs: - Cloud hosting (AWS/Azure/GCP) - LLM API costs (Azure OpenAI tokens) - Third-party SaaS (Sentry, monitoring tools)
Estimated Monthly Run Cost (Phase 1): - Compute: $200-500/month (2-3 VMs) - Database: $100/month (managed PostgreSQL) - LLM API: $50-200/month (depends on email volume) - Total: ~$500-1000/month
Summary: Is 176h Enough for Ops?¶
Yes for MVP, because: - We're not building Kubernetes (Docker Compose OK) - Activity Stream is simple write-only table - Prometheus + Grafana are OSS (low setup time) - Disaster recovery is PostgreSQL point-in-time recovery
But Phase 1 (Tally) may need +20-30h for: - Tally sync monitoring dashboard - Draft approval analytics - Staff performance tracking (who's using AI drafts?)