Deepthi Panthula
Senior Staff Product Manager
San Jose, California, United States
Actions
Deepthi Panthula is a Senior Staff Product Manager for Reliability Engineering at Intuit, a CNCF end user company, where she leads the product strategy for key platform areas including disaster recovery, chaos engineering, performance testing, and AI-driven resilience. She drives the vision and strategy for Intuit’s fault-tolerant disaster recovery, performance, and chaos engineering platforms—critical capabilities that ensure the reliability and scalability of thousands of cloud-native services. She brings a deep product-first perspective to the evolving world of cloud-native reliability.
Area of Expertise
Topics
Proactive Resilience: Leveraging Generative AI for Chaos and FMEA Management
In today’s rapidly evolving digital landscape, proactively identifying chaos and Failure Mode and Effects Analysis (FMEA) scenarios is critical for maintaining resilient systems. In this talk, we will delve into how generative AI can significantly enhance risk management and incident recovery efforts. By integrating diverse data sources—such as dependency graphs, historical incident reports, observability metrics, anomaly detection, and performance logs—organizations can uncover vulnerabilities and derive actionable insights before potential issues escalate.
Generative AI serves a pivotal role in this framework by highlighting and automating the identification of risk scenarios. By leveraging its inferential capabilities, AI not only surfaces hidden weaknesses but also triggers alerts and automates response protocols, keeping teams informed and ensuring they can focus their expertise on critical decision-making. We will present compelling case studies from Intuit to showcase how these AI-driven strategies empower teams to cultivate proactive incident response mechanisms and bolster overall system resilience.
Join us to explore how harnessing the power of generative AI can illuminate potential risks, automate routine assessments, and foster a culture of vigilance and adaptability within your organization. Discover how to use AI as a powerful ally in creating a more secure and reliable operational environment!
Resilience for Large-Scale Kubernetes Deployments: Intuit's AI-Driven Reliability Agents
Managing reliability in a large-scale, cloud-native ecosystem with thousands of micro-services across hundreds of Kubernetes clusters (~345+) is no small feat. At Intuit, traditional methods like Failure Mode and Effects Analysis (FMEA) were too manual, slow, and inconsistent for a fast-evolving platform. Failures are inevitable, and the stakes—ranging from revenue loss to brand impact—are too high to rely on manual analysis alone.
To address this, Intuit developed Agentic Reliability Engineering—an AI-powered framework that acts as a built-in reliability expert. Leveraging a LangChain-based LLM, knowledge graphs, and service dependency data, it automates the generation of focused FMEA templates and resilience patterns tailored to specific services & workflows. This self-service approach empowers teams to identify risks and design for resilience without manual overhead. Reliability Agent shifts engineering teams from reactive postmortems to proactive, design-time reliability.
Proactive Resilience: Leveraging Generative AI for Chaos and FMEA Management
In today’s rapidly evolving digital landscape, proactively identifying chaos and Failure Mode and Effects Analysis (FMEA) scenarios is critical for maintaining resilient systems. In this talk, we will delve into how generative AI can significantly enhance risk management and incident recovery efforts. By integrating diverse data sources—such as dependency graphs, historical incident reports, observability metrics, anomaly detection, and performance logs—organizations can uncover vulnerabilities and derive actionable insights before potential issues escalate.
Generative AI serves a pivotal role in this framework by highlighting and automating the identification of risk scenarios. By leveraging its inferential capabilities, AI not only surfaces hidden weaknesses but also triggers alerts and automates response protocols, keeping teams informed and ensuring they can focus their expertise on critical decision-making. We will present compelling case studies from Intuit to showcase how these AI-driven strategies empower teams to cultivate proactive incident response mechanisms and bolster overall system resilience.
Join us to explore how harnessing the power of generative AI can illuminate potential risks, automate routine assessments, and foster a culture of vigilance and adaptability within your organization. Discover how to use AI as a powerful ally in creating a more secure and reliable operational environment!
Engineering for Outages: Intuit’s Scalable and Developer-Centric Disaster Recovery Platform
What does it take to make regional failover developer-friendly, scalable, and routine? At Intuit, where 2500+ services run on 345+ Kubernetes clusters, disaster recovery (DR) evolved from a compliance checkbox to a complex product challenge. Manual processes and scattered scripts didn’t scale—so we built EWOK (Ecosystem Wide Orchestrator Kit): a self-service DR automation platform for engineers, not just SREs.
This talk covers how we designed EWOK to support multi-region resilience using open source tools, state machines, declarative orchestration, group failover logic, observability hooks, and governance workflows. We’ll show how it reduces MTTR, empowers teams, and makes DR an integrated part of the development lifecycle. Whether you're starting your resilience journey or managing DR at scale, this session delivers tactical insights and proven patterns for your Kubernetes environment.
Building a Culture of Continuous Resiliency
Failures are inevitable, and well architected distributed systems aren’t any exception. Any outage or turbulence in production not only impact revenue but also damage your company’s brand and reputation. With increasing complexity of the micro service architecture world, it is important to ensure that products & platforms are reliable and should be proactively validated before a real incident. Delightful and uninterrupted experience to the end users is a must.
In this session, we will share how Intuit with 1000s of services across 200+ clusters is validating resiliency at scale by leveraging company wide Game Day events as well as continuous integration pipelines. We will demonstrate our adoption of open source chaos engineering capabilities using LitmusChaos, a cloud native computing foundation (CNCF) project, integrated with Argo, observability, and Intuit’s remediation tools. Come hear about our learnings and journey so you can apply the same principles and patterns within your organizations to help release reliable products with confidence.
Benefits to the Ecosystem:
At Intuit, we have several flagship products (Turbotax, Quickbooks, Mint, Credit Karma, MailChimp) that serve millions of customers. Our mission is powering prosperity around the world and making sure any incident goes through detailed root cause analysis. During this course, we have learned that we must ensure that our products are reliable and can withstand any turbulence.
In this talk, we will share insights on how Intuit paved its path to perform company-wide mandatory Game days and how other teams can apply similar processes & automations to achieve continuous resilience at scale. We will demonstrate how teams can use our integration patterns using various CNCF projects (LitmusChaos, ArgoWorkflow, ArgoCD, Argo Application Sets) to execute chaos both within a continuous pipeline or ad-hoc during game days. We will share how we enabled 1000s of Intuit developers to think about resiliency as part of design and implementation and also share our learnings on how to execute company-wide game days to reduce outages in production in a controlled manner.
Automating Resiliency at Scale
Failures are inevitable! With increasing complexity and dependencies in the micro services world, it would be impossible to avoid failures, but one can be prepared for failures by building resilient systems. These systems should proactively recover with appropriate monitoring and alerts and provide delightful and uninterrupted experiences to end users with fewer outages & less disruptions.
In this session, we will share how Intuit with 1000s of services across 100s of Kubernetes clusters automated resiliency at scale with a simplified and self-serve experience via a continuous integration pipeline. By leveraging LitmusChaos (an open-source cloud-native Chaos Engineering framework) and integration with Argo tools (ArgoCD, Argo Workflows, Argo Applicationsets) , we achieved higher developer productivity that is enabling thousand of developers across the organization to build & ship reliable products. Also, we will share our learnings and journey on how this approach paved the way to conduct Intuit-wide game days so the same principles and patterns can be applied within your organizations to gain more confidence to execute ad-hoc chaos testing in the production.
DeveloperWeek CloudX 2023 Sessionize Event
DeveloperWeek Cloud 2022 Sessionize Event
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top