Chaos & Behind the Scenes of Kubernetes in Production Blunders

In the fast-paced world of AI and cloud-native deployments, chaos can often become an uninvited guest. At Tune AI, we've encountered significant challenges while building and deploying a developer platform for Large Language Models (LLMs). This talk will delve into the incidents we faced and their resolutions while managing our deployed platform on Kubernetes in production, highlighting three critical production-level incidents that led to downtime. These incidents include manually updating Kubernetes deployments, handling resource deletions, and managing kubectl contexts. Additionally, it discusses the loss of the Terraform state of production due to an incorrect backend configuration of Terraform and how we recovered it. Attendees will learn best practices for managing Kubernetes deployments and avoiding common pitfalls, effective strategies for managing resources and contexts, and how to ensure robust Terraform configurations to prevent production downtime. By sharing these experiences, this session aims to provide practical insights and solutions to help attendees navigate similar challenges. Using our journey at Tune AI as a case study, it will illustrate the impact of these best practices and how they can significantly improve deployment stability and efficiency.

Rohit Ghumare

CNCF Ambassador

London, United Kingdom

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Chaos & Behind the Scenes of Kubernetes in Production Blunders

Rohit Ghumare

Links

Actions