Session
Resilience for Large-Scale Kubernetes Deployments: Intuit's AI-Driven Reliability Agents
Managing reliability in a large-scale, cloud-native ecosystem with thousands of micro-services across hundreds of Kubernetes clusters (~345+) is no small feat. At Intuit, traditional methods like Failure Mode and Effects Analysis (FMEA) were too manual, slow, and inconsistent for a fast-evolving platform. Failures are inevitable, and the stakes—ranging from revenue loss to brand impact—are too high to rely on manual analysis alone.
To address this, Intuit developed Agentic Reliability Engineering—an AI-powered framework that acts as a built-in reliability expert. Leveraging a LangChain-based LLM, knowledge graphs, and service dependency data, it automates the generation of focused FMEA templates and resilience patterns tailored to specific services & workflows. This self-service approach empowers teams to identify risks and design for resilience without manual overhead. Reliability Agent shifts engineering teams from reactive postmortems to proactive, design-time reliability.
Deepthi Panthula
Senior Staff Product Manager
San Jose, California, United States
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top