SRE Playbook for LLMs and AI Agents: Observability, Scaling, and Reliability

AI agents are hitting production faster than SRE practices can keep up. Traditional RED metrics don't capture reasoning loops, context window exhaustion, or LLM provider throttling and your existing HPA won't scale what it can't measure. This talk delivers the playbook SREs need: which metrics to monitor for LLMs and AI agents, how to define meaningful SLOs for non-deterministic workloads, and how to autoscale agent workers using KEDA with custom Prometheus metrics built from real-world experience operating AI workloads on Kubernetes at scale.

Anuj Tyagi

Senior Site Reliability Engineer - AI

Middletown, Delaware, United States

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

SRE Playbook for LLMs and AI Agents: Observability, Scaling, and Reliability

Anuj Tyagi

Links

Actions