From Outage Chaos to Agentic Ops: AI Diagnosis for Cloud-Native DBaaS

Managing 1000+ databases across Kubernetes means your on-call engineer is buried in runbooks, Grafana dashboards, and kubectl tabs at 2 AM. Operating diverse, stateful workloads like MongoDB, PostgreSQL, and Valkey at scale is notoriously difficult; each relies on its own operator, distinct CRDs, and unique failure mode fingerprints.

When a production outage strikes, traditional dashboards often fall short. Triage requires an engineer to manually correlate a mountain of disconnected data: Kubernetes event streams, pod states, PVC health, operator controller logs, and real-time Prometheus metrics. Hunting across these silos under pressure increases Mean Time to Resolution (MTTR) and drives burnout. To solve this, we built a read-only, safety-first AI Triage Agent.

This talk shares our journey of building this agent using LangGraph, FastMCP, and the Kubernetes Python client. We’ll dive into our architecture of "Skill Runners" - Python-based logic that encodes playbook knowledge so the agent can autonomously query PromQL and inspect CRDs. We’ll cover how we reduced LLM token overhead by ~1200 tokens per call, the security of enforcing read-only constraints in code, and the real production failures we encountered along the way.

Vignesh Muthu.S

Engineering Leader focused on building scalable, self-service database ecosystems and enabling Comcast's digital transformation.

Chennai, India

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

From Outage Chaos to Agentic Ops: AI Diagnosis for Cloud-Native DBaaS

Vignesh Muthu.S

Links

Actions