When Logging Becomes The Outage: Escaping the ECS Logging Trap

Logging systems are designed to help us understand failures, but what happens when the Logging itself becomes the cause of an outage?

In this session, I will walk through a real-world incident involving Amazon ECS where the default logging configuration used blocking mode with CloudWatch as the log destination. When CloudWatch experienced an outage, application containers continued attempting to push logs while buffering them locally. As the logging buffer reached its limit, containers became blocked waiting for the logging driver, ultimately impacting application availability.

This talk explores how a seemingly harmless default configuration can create an unexpected reliability risk in distributed systems. We will look at how ECS logging works under the hood, why blocking mode can create cascading failures during downstream outages, and how switching to non-blocking mode can isolate application workloads from logging system failures.

I will share the investigation process, architectural decisions, and practical lessons learned from redesigning the logging strategy to prevent observability dependencies from affecting production workloads.

Attendees will leave with actionable guidance on designing resilient logging pipelines and avoiding a class of failures where logging infrastructure unintentionally becomes the single point of failure.

Rahul Tanniru

Senior Vice President Of Software Engineering, Jp Morgan Chase

Dallas, Texas, United States

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

When Logging Becomes The Outage: Escaping the ECS Logging Trap

Rahul Tanniru

Links

Actions