Session
OpsAI: Incident Investigation, Reimagined with AI Agents
Every incident follows the same pattern. Alerts fire, you open four terminals, correlate logs with recent deployments, check cluster state, dig through git history, and slowly piece together what went wrong. The tools are good. The process is exhausting.
OpsAI is a multi-agent AI system we built to tackle this. It investigates incidents by pulling evidence from logs, Kubernetes state, and git repositories, then produces answers where every claim is tied to a real source. No hallucinated pod names. No invented timelines. Every assertion points back to a log line, a commit, or a cluster object.
This talk covers what we learned building it in production: why evidence citation has to be an architectural constraint and not an afterthought, how we structured git, Loki, and Kubernetes snapshots as complementary evidence layers, how multi-agent coordination works when sub-questions need different specialists, and why running AI inference workloads on Kubernetes is a different class of operational problem than most teams expect.
The goal is to share a concrete architecture pattern that others can apply, and be honest about where it breaks down.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top