Venkata Srinivas Kantamneni
Richmond, Virginia, United States
Actions
Links
Area of Expertise
Topics
Why Predicting Cloud Failures Before They Happen Is the Only Strategy That Scales
Most reliability engineering today is built around detection and response. Something breaks, an alert fires, an engineer wakes up, and the race to recover begins. The tooling is sophisticated. The runbooks are detailed. The post-mortems are thorough. And none of it changes the fundamental fact that by the time your monitoring system tells you something is wrong, something is already wrong.
From Fragile to Fault-Tolerant: Coding Strategies for Maximum Application Stability
This topic covers the essential practices and patterns developers must adopt to embed resilience (the ability to recover from failure) and reliability (the probability of operating without failure) directly into the application code during the development phase.
It goes beyond infrastructure-level solutions to focus on writing code that anticipates and gracefully handles errors, resource constraints, and dependency failures.
Key Areas of Focus:
Error Handling and Retries: Implementing robust mechanisms like exponential backoff and jitter for retrying failed operations against external services or databases.
Circuit Breaker Pattern: Designing components that can detect when a dependency is down or slow and automatically "trip" the circuit to prevent cascading failures, allowing the dependency time to recover.
Bulkhead Pattern: Isolating different parts of the application or workload into separate pools of resources (like threads or connections) so that a failure in one area cannot exhaust resources needed by another, ensuring fault isolation.
Timeouts and Deadlines: Properly configuring connection and operation timeouts to ensure threads are not indefinitely held hostage by slow or unresponsive dependencies.
Idempotency: Designing APIs and operations to be idempotent, meaning they can be called multiple times without causing different effects beyond the initial call (critical for safe retries).
Defensive Coding: Techniques like input validation, fail-fast principles, and ensuring all critical paths have clearly defined error states and recovery options.
The goal is to provide developers with concrete code patterns and architectural best practices to significantly reduce unexpected downtime and enhance the user experience.
Reliability by Design: A Practical Guide to SLO Monitoring
This session will provide a clear, practical, and comprehensive look at Service Level Objectives (SLOs) and their crucial role in modern software development and operations. We'll start with the fundamentals: what are SLOs, and how do they differ from Service Level Indicators (SLIs) and Service Level Agreements (SLAs)?
You'll learn how to identify the most critical user journeys and define meaningful SLOs that align with business goals and user expectations. We'll cover practical techniques for monitoring these objectives, including how to select the right metrics and build effective dashboards and alerting systems.
The core of this session will focus on the importance of SLO monitoring. We'll explore how effective SLO monitoring can help teams:
Make data-driven decisions: Move beyond intuition and "gut feelings" to prioritize work based on real-world impact.
Improve system reliability: Proactively identify and address potential issues before they cause widespread outages.
Enhance team collaboration: Foster better communication between engineering, product, and business teams by providing a shared, objective view of system health.
By the end of this session, you'll have a solid understanding of how to implement a successful SLO monitoring strategy that not only improves the reliability of your services but also creates a more sustainable and collaborative engineering culture.
Automated Regional Failover
How to achieve automated failover to another region during DR / Service outage
Automating Chaos: Scalable Regional Isolation Testing on AWS for Modern Resilience
This session offers a behind-the-scenes look at how a leading digital scaled chaos engineering on AWS to validate and strengthen regional resilience. Discover how regional isolation testing evolved from manual experimentation into a fully automated, repeatable practice that’s embedded into the fabric of production operations.
Attendees will gain practical insights into:
• Designing an automated regional isolation framework across availability zones and regions
• Integrating chaos triggers with observability tools like New Relic and Splunk
• Embedding failover validation into CI/CD pipelines using deployment guardrails
• Leveraging chaos engineering to drive resilient architecture decisions
• Moving from isolated experiments to enterprise-wide reliability practices
Whether you’re an SRE, cloud engineer, or technology leader, this session will help you move beyond chaos theory and into a proactive resilience strategy—one that ensures systems degrade gracefully and recover predictably at scale.
Architecting the Error Budget: How SLT Use SRE Metrics to Balance Innovation and Risk
You will learn how the Error Budget is the financial and operational expression of resilience, and how you can use it to make data-driven decisions on reliability, feature velocity, and risk tolerance across your product portfolio.
Key Takeaways for Leaders:
The Cost of Perfection: Understand why chasing 100% availability is economically irrational and how to define a Maximum Tolerable Downtime (MTD) that aligns with customer willingness to pay and competitive market standards.
The Error Budget as Capital: Learn to view the Error Budget as the shared currency between Development (velocity) and Operations (stability). We will cover how to govern this budget to enforce a disciplined trade-off between shipping new features and improving resilience.
SLOs for Business Alignment: Discover how to establish user-centric Service Level Objectives (SLOs) that directly map to your organization's business Key Performance Indicators (KPIs), such as customer retention, conversion rates, and revenue.
Driving Proactive Investment: Use Error Budget consumption metrics to proactively justify and prioritize investment in reliability work (like automation, testing, and Chaos Engineering) before system failures force a costly, reactive halt to feature development.
This session will equip you with the managerial vocabulary and framework needed to lead SRE adoption and embed organizational resilience into your quarterly planning and resource allocation cycles.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top