Session

Everything, Everywhere, All at Once

On March 8, 2023 Datadog experienced a massive global outage that took almost 24 hours to mitigate and a further ~24 hours to backfill data after restoring full app availability. We’ll share the trigger for the incident and why it was such a massive effort to recover from. We’ll review the technical details of the incident: why and how we lost more than 60% of our Kubernetes nodes in less than an hour, and the challenges we faced to recover the tens of thousands of impacted nodes across hundreds of clusters. This was a very tough day for us, and we will share those hard-won technical and community lessons.

Hemanth Malla

Senior Software Engineer, Datadog

New York City, New York, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top