Session
SRE Playbook for LLMs and AI Agents: Observability, Scaling, and Reliability
AI agents are hitting production faster than SRE practices can keep up. Traditional RED metrics don't capture reasoning loops, context window exhaustion, or LLM provider throttling and your existing HPA won't scale what it can't measure. This talk delivers the playbook SREs need: which metrics to monitor for LLMs and AI agents, how to define meaningful SLOs for non-deterministic workloads, and how to autoscale agent workers using KEDA with custom Prometheus metrics built from real-world experience operating AI workloads on Kubernetes at scale.
Anuj Tyagi
Senior Site Reliability Engineer - AI
Middletown, Delaware, United States
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top