Rahul Tanniru

Senior Vice President Of Software Engineering, Jp Morgan Chase

Dallas, Texas, United States

Actions

I am a Senior Vice President of Software Engineering at JPMorgan Chase, specializing in enterprise-scale cloud platforms, distributed systems, and AI-enabled infrastructure. I lead and contribute to the design of secure, scalable, and highly resilient cloud-native architectures supporting mission-critical financial systems.

My expertise spans AWS, Kubernetes, platform engineering, DevOps, and Site Reliability Engineering, with a strong focus on building self-service infrastructure platforms and enabling large engineering organizations to operate at scale. I work extensively on multi-account cloud architectures, infrastructure automation, reliability engineering, and integrating AI and machine learning workloads into modern cloud environments.

Over the past decade, I have focused on designing high-performance systems that improve scalability, operational efficiency, and platform reliability in complex enterprise environments. I am passionate about advancing cloud architecture standards, mentoring engineers, evaluating innovative technology solutions, and contributing to the broader technology community through speaking, technical publications, and industry judging initiatives.

Links

LinkedIn

Area of Expertise

Finance & Banking
Information & Communications Technology

Topics

aws
Containers
DevOps & Automation
SRE
EKS
ECS
software architecure

When Logging Becomes The Outage: Escaping the ECS Logging Trap

Logging systems are typically treated as passive observers of production workloads. But in containerized environments, logging pipelines can quietly become a critical dependency. In one of our large ECS environments, a downstream logging disruption created a cascading reliability risk. Because workloads were configured with blocking log drivers, application containers began stalling once the log buffers filled, effectively turning the observability pipeline into a potential outage trigger.

This session walks through the real reliability problem, the investigation process, and the architectural changes that followed. We’ll explore how blocking logging modes interact with downstream failures, why this configuration can introduce hidden reliability risks, and how switching to non-blocking logging changes system behavior during logging outages.

The talk will cover practical strategies for building resilient logging pipelines in ECS environments, including buffer management, failure isolation, and protecting application workloads from observability dependencies. Attendees will walk away with a better understanding of how to design logging architectures that support reliability rather than accidentally becoming a source of downtime.

When Logging Becomes The Outage: Escaping the ECS Logging Trap

Logging systems are designed to help us understand failures, but what happens when the Logging itself becomes the cause of an outage?

In this session, I will walk through a real-world incident involving Amazon ECS where the default logging configuration used blocking mode with CloudWatch as the log destination. When CloudWatch experienced an outage, application containers continued attempting to push logs while buffering them locally. As the logging buffer reached its limit, containers became blocked waiting for the logging driver, ultimately impacting application availability.

This talk explores how a seemingly harmless default configuration can create an unexpected reliability risk in distributed systems. We will look at how ECS logging works under the hood, why blocking mode can create cascading failures during downstream outages, and how switching to non-blocking mode can isolate application workloads from logging system failures.

I will share the investigation process, architectural decisions, and practical lessons learned from redesigning the logging strategy to prevent observability dependencies from affecting production workloads.

Attendees will leave with actionable guidance on designing resilient logging pipelines and avoiding a class of failures where logging infrastructure unintentionally becomes the single point of failure.

Kubernetes Is Expensive,Until It Isn’t: Lessons from Optimizing EKS at Scale

Kubernetes makes it incredibly easy to scale applications, but it also makes it very easy to overspend on infrastructure. Many clusters end up running with underutilized nodes, oversized pod resource requests, and environments that stay online even when no workloads actually need them.

In this session, I’ll share lessons from optimizing Amazon EKS clusters in a real production environment. After analyzing node utilization, pod resource usage, and workload patterns, we discovered that a large portion of compute capacity across clusters was sitting idle or being used inefficiently.

To address this, we introduced several improvements across the platform. We implemented dynamic node provisioning using Karpenter to launch nodes based on real workload demand and used a mix of On-Demand and Spot instances to reduce infrastructure costs while maintaining reliability. We also improved workload scaling using Horizontal Pod Autoscaler and introduced Vertical Pod Autoscaler to better right-size pod resource requests based on real usage patterns.

Beyond cluster architecture changes, we also implemented operational improvements such as lightswitch scheduling, where non-production environments automatically scale down or shut off during nights and weekends and start again during working hours. This simple practice alone eliminated a surprising amount of unnecessary compute usage that many teams overlook.

This talk will walk through the real challenges we encountered, the architectural and operational decisions we made, and the practical lessons learned from running Kubernetes clusters more efficiently. The goal is to share approaches that platform and DevOps teams can apply to reduce Kubernetes costs without sacrificing reliability or performance.

DevOps Days Dallas 2026 Sessionize Event Upcoming

September 2026 Dallas, Texas, United States

CommunityDays KC 2026 Sessionize Event

May 2026 Overland Park, Kansas, United States

Rahul Tanniru

Senior Vice President Of Software Engineering, Jp Morgan Chase

Dallas, Texas, United States

Links

LinkedIn

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Rahul Tanniru

Actions

Links

Area of Expertise

Topics

Sessions

When Logging Becomes The Outage: Escaping the ECS Logging Trap

When Logging Becomes The Outage: Escaping the ECS Logging Trap

Kubernetes Is Expensive,Until It Isn’t: Lessons from Optimizing EKS at Scale

Events

DevOps Days Dallas 2026 Sessionize Event Upcoming

CommunityDays KC 2026 Sessionize Event

Rahul Tanniru

Links

Actions