Pavneet Ahluwalia
Product @ Microsoft Azure
Toronto, Canada
Actions
Pavneet is a Staff Product manager at Together AI leading IaaS and GPU clusters product for AI native startups. Previously he was a product lead at Azure Kubernetes Service focusing on Scale and Performance. He has 7 years of experience working in containers and cloud platform. Prior to that he worked in Digital marketing, and data science and co-founded 2 startups with successful exits.
Links
Area of Expertise
Topics
From Chaos to Clarity: Building an Enterprise-Scale MCP Server for Kubernetes Troubleshooting
How we built an MCP server to power AI-driven Kubernetes troubleshooting at enterprise scale — balancing security, observability, and tool sprawl in real production environments!
As Kubernetes platforms scale management becomes the hardest problem. The Model Context Protocol (MCP) offers a powerful way to connect AI agents to real operational systems.
The server acts as a control plane for AI-driven diagnostics, dynamically invoking cluster APIs, metrics stores, logs, and installing on-demand tools to answer “what’s broken and why?” in real time.
We’ll cover:
- MCP server architecture for K8s troubleshooting, including multi-cluster access, tool isolation, and safe execution patterns
- Hard-earned lessons on tool sprawl — how adding more tools degraded latency and reliability, and how we simplified everything into generic, composable CLI-based tools
- Security evolution: moving from OAuth-based auth to managed identity and enterprise RBAC
Beyond ChatOps: Agentic AI in Kubernetes—What Works, What Breaks, and What’s Next
Agentic AI is evolving from hype to hands-on reality—no longer just copilots, but autonomous actors in Kubernetes clusters. But how effective are these AI agents in real-world ops?
This panel brings together builders and operators who've deployed LLM-powered agents at scale in production to share what worked, what broke, and what surprised them. Expect a candid, high-signal conversation on the true strengths and sharp limitations of AI agents for Kubernetes.
SREs, platform engineers/operators—come with questions, leave with a clearer sense of where AI can reduce toil, when it still needs babysitting(human-in-the-loop), and how to experiment and deploy safely.
We’ll cover:
- High-efficacy use cases: RCA, triage, incident summarization
- Common failure patterns: hallucinations, context loss, unpredictability, alert attention
- Evaluation strategies in dynamic prod environments
- Design trends: agent chaining, feedback loops, safety guardrails
CLI Agent for AKS: AI-Powered Troubleshooting from Your Terminal
The CLI Agent for AKS brings an AI troubleshooting loop directly into az so operators can ask natural-language questions (e.g., “why is my pod Pending?”) and get grounded reasoning, targeted diagnostics, and safe, human-in-the-loop remediation steps—without leaving the terminal. It’s built on open source (HolmesGPT + AKS-MCP), runs locally with your Azure RBAC, and is extensible via runbooks and MCP toolsets. In this demo we will go through how to get started with CLI agent for AKS and AKS MCP server, some complex networking issue troubleshooting using "az aks agent" and share how you can get involved in the community.
Why this matters (3 takeaways)
- Cut MTTR: Turn scattered logs/metrics into concise RCA with actionable next steps.
- Secure by design: Local execution + Azure CLI auth; no cluster changes without explicit approval.
- Composable & open: Plug in your AI provider, observability stack, and MCP tools/runbooks as “lego blocks". Enabling the community to contribute.
Pavneet Ahluwalia
Product @ Microsoft Azure
Toronto, Canada
Links
Actions
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top