Ishan Shah
PayPal, Software Engineer | Distributed Systems, AI, and Platform Engineering
San Francisco, California, United States
Actions
Ishan Shah is a Staff Software Engineer with 10+ years of experience designing and delivering large-scale, event-driven, and cloud-native systems across PayPal, Nordstrom, and Securonix. He has built real-time data platforms, CDC pipelines, inventory and decisioning systems, and high-throughput search and analytics infrastructure.
His work increasingly sits at the intersection of AI and engineering operations, including AI-assisted development workflows, context engineering, agent guardrails, and AI-native reliability patterns. Ishan is especially interested in practical approaches to autonomous software development, AI SRE, incident-response agents, and “AI gardening” — managing context, memory, and entropy in long-running agentic systems.
He mentors teams on reliability, observability, schema governance, and scalable platform design, with deep expertise in Java, Kafka, Debezium, AWS, Redis, Postgres, DynamoDB, and distributed systems.
Links
Area of Expertise
Topics
AI Ops for Real Systems: Keeping Production AI Grounded with Fresh, Reliable Event Streams
One of the biggest failure modes in production AI is not the model, but the data pipeline feeding it. This talk shows how to design operational event streams that keep AI applications grounded in fresh, trustworthy state. Using lessons from large-scale distributed systems, I’ll cover freshness SLAs, lag detection tied to domain truth, replay/backfill safety, schema governance, and failure playbooks so downstream AI services don’t make decisions on stale or inconsistent data.
Partitioning with Purpose: Kafka Producer Strategies that Cut Lag ~30%
Hot partitions and uneven key distribution are common—and they quietly cap throughput. This session breaks down how purposeful partitioning (entity‑affinity keys, composite keys, and targeted salting) plus tuned producer configs (batch.size, linger.ms, acks, idempotent producers) reduced consumer lag by ~30% in a high‑throughput pipeline. We’ll align partition strategy with consumer concurrency, discuss sticky vs uniform distribution, and show practical skew detection (per‑partition lag, heatmaps) and mitigation (pre‑partitioning, re‑keying). You’ll leave with a field‑tested checklist to diagnose hotspots, tune batching, and roll out changes safely without thundering herd rebalances.
CDC in the Real World: Debezium + Kafka Streams for Trustworthy Inventory
Inventory and catalog drift hurts availability, promise accuracy, and customer trust. This talk shares a production‑proven CDC blueprint using Debezium → Kafka (Avro + Schema Registry) → Kafka Streams state stores to publish Postgres RDS changes, preserve per‑entity ordering, and keep availability accurate in near real time. We’ll cover business‑keyed partitions, compaction, idempotency, handling late/out‑of‑order events, and safe replays/backfills without double‑counting. You’ll see the operational playbook (lag SLOs tied to domain truth, DLQ triage, and runbooks for connector restarts/rebalances) plus governance practices for topic conventions and schema evolution. We’ll close with outcomes from deploying unified outbound inventory events and maintaining bounded lag under bursty load.
Speed vs. Cost in Fulfillment: Optimizing What Actually Matters
“Closest FC ≠ fastest or cheapest.” This talk walks through a production-grade optimization that selects fulfillment centers by true speed and true cost—combining labor models, carrier rate cards, and SLA penalties with solver-backed routing. We’ll cover inputs (zones, weights, surcharges), constraints (eligibility, capacity), experimentation (backtests + traffic splits), and rollout guardrails. Expect real-world pitfalls, what moved the needle, and how we measured success.
Schema Governance That Scales: Contracts, Versioning, and Safe Evolution
Abstract: The fastest way to break a streaming platform is casual schema changes. This talk shares a governance model: namespacing, compatibility rules, deprecation playbooks, linting, CI gates, and topic lifecycle policies. We’ll cover consumer-driven contracts, backfills, and rollout sequencing that avoids “flag days.”
Reverse Logistics as a Platform: From RMA to Restock in Hours
Abstract: Returns are data-heavy and time-sensitive. We’ll design an event-driven reverse-logistics pipeline that connects carriers, FCs, QC, and catalog availability. Topics include as-received vs. as-inspected states, partial credits, fraud controls, and rapid resale. Expect architecture diagrams and the KPIs that changed the business.
Experimentation for Platforms: A/B Testing Your Supply Chain & Routing
Feature flags aren’t enough when the “feature” is a new optimizer or routing heuristic. Learn how to run fair, explainable experiments on fulfillment and logistics: backtests vs. live splits, guardrails, unit-economics attribution, and communicating wins (or honest losses) to non-technical stakeholders.
AI Gardening: Managing Entropy in Agentic Engineering Systems
As AI workflows grow, so does entropy: context bloat, stale memory, conflicting instructions, noisy tools, and degraded outputs over time. This talk introduces AI Gardening as a practical operating model for keeping agentic systems healthy: pruning context, shaping memory, isolating tasks, designing reset boundaries, and controlling prompt-state sprawl before it turns into drift and unreliability.
What would be covered:
• Why agent quality drops as context accumulates
• prompt-state sprawl and hidden coupling
• pruning, compaction, reset, and scoped-memory patterns
• context engineering vs prompt hacking
• how to keep long-running AI workflows maintainable
From Prompting to Production: Guardrails for Autonomous Software Development Agents
Many teams can demo coding agents. Far fewer can make them safe enough for real engineering use. This session walks through a production-minded blueprint for autonomous software-development agents that can move from issue understanding to implementation, testing, and PR creation using context engineering, harness-based execution, and guardrails.
What would be covered:
• task decomposition and bounded execution
• context packaging for repos, tickets, logs, and design docs
• harness-based validation in Dockerized test environments
• tool permissioning and approval boundaries
• generating reviewable PRs instead of unsafe direct changes
AI SRE: Building Incident-Response Agents That Start the RCA Before You Do
PagerDuty goes off. Before a human fully opens the laptop, an AI SRE agent can already be pulling telemetry, checking dashboards, correlating logs, and drafting an incident summary. This talk shows how to design an AI incident-response workflow that integrates with tools like PagerDuty, Datadog, and New Relic to accelerate triage without bypassing safety.
What would be covered:
• event trigger from alert to investigation
• gathering evidence from observability systems
• forming a first-pass RCA hypothesis
• drafting timelines and incident summaries
• keeping humans in the approval loop
From RCA to PR: Safe Remediation Workflows with Autonomous Agents
Finding the issue is only half the story. The next frontier is agents that can take incident context, propose a code or config fix, validate it in isolated environments, and prepare a PR for human review. This talk covers the architecture, control points, and failure modes of building AI remediation workflows that are useful without being reckless.
What you’d cover:
• turning logs + RCA into executable change candidates
• sandboxed code/config modification
• automated test harnesses and regression gates
• rollback awareness and blast-radius controls
• PR-first workflows instead of auto-merge fantasies
Ishan Shah
PayPal, Software Engineer | Distributed Systems, AI, and Platform Engineering
San Francisco, California, United States
Links
Actions
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top