Pratik Mahalle's Speaker Profile @ Sessionize

Why Traces Didn’t Explain Our Search Latency, Until We Changed How We Used OpenTelemetry

Search performance issues are notoriously hard to debug. We assumed that adding OpenTelemetry tracing across our search pipeline would immediately make latency problems obvious. Instead, we ended up with detailed traces that explained very little.

In this session, I’ll share why our initial OpenTelemetry setup failed to help us debug search latency, even though we were “doing everything right.” The real issue wasn’t missing spans—it was missing context. We weren’t capturing the right attributes to explain shard fan-out, query complexity, cache behavior, or ranking stages.

I’ll walk through how we rethought instrumentation for search systems: what not to trace, which semantic attributes actually matter, and how to connect user-facing queries to backend execution paths. Using real examples, I’ll show how a small change in instrumentation exposed retry storms and cache churn that were invisible before—and how fixing those reduced tail latency significantly.

This talk is less about OpenTelemetry basics and more about learning how to use it effectively for complex, distributed search workloads.

Zero-Code Observability: Using eBPF to Monitor Kubernetes

Traditional observability approaches require extensive application instrumentation, adding complexity and maintenance overhead to your codebase. What if you could gain deep insights into your Kubernetes workloads without modifying a single line of application code?This talk explores production-grade eBPF-based observability, demonstrating how kernel-level instrumentation provides comprehensive visibility into service behavior, network traffic, and system performance. Using tools like Cilium Hubble and Pixie, we'll show how eBPF enables automatic discovery of service dependencies, real-time network analysis, and zero-overhead profiling.

Why Our First Self-Service Platform Failed and How We Rebuilt It

Platform engineering promises to improve developer productivity, but many platforms fail due to poor developer experience. Our first platform had comprehensive features but near-zero adoption. Developers continued using manual processes despite our investments in automation.
This session shares our journey rebuilding the platform with developers as partners, not customers. We'll reveal the critical mistakes in our first attempt: assuming what developers needed without asking, building complex abstractions that leaked implementation details, and creating cognitive overhead that exceeded the value provided.
The talk demonstrates our second approach: conducting developer experience audits to identify friction points, implementing progressive disclosure so basic workflows remain simple while advanced capabilities stay accessible, and using contract testing to prevent platform changes from breaking developer workflows.

When Contract Testing Slows Teams Down: What Microcks Solved and What It Didn’t

breaking changes early. In reality, our first attempt at enforcing contract testing created more problems than it solved—failing builds, frustrated developers, and teams bypassing the system entirely.

In this talk, I’ll share a real-world journey of adopting Microcks for API mocking and contract testing, including the mistakes we made early on. I’ll walk through how strict contract enforcement backfired, why developers resisted stateful mocks, and where automated validation became noise instead of signal. More importantly, I’ll explain how we adjusted our workflow to make Microcks genuinely useful: introducing progressive enforcement, scoping contract checks to high-risk APIs, and using recorded interactions instead of hand-written mocks.

The session focuses on practical tradeoffs rather than ideal setups. I’ll show CI examples, policy changes, and team feedback that shaped the final approach. Attendees will leave with a realistic framework for deciding where contract testing adds value and where it simply slows teams down.

Tracing the "Thought Process": Observability for AI Agents via MCP and OpenTelemetry

As AI agents evolve from simple chatbots into complex orchestrators using the Model Context Protocol (MCP), they have become the ultimate distributed system "black boxes." When an agent fails to complete a task or enters an infinite reasoning loop, traditional APM metrics like latency and CPU usage offer no clues. We need to see the "why" behind the agent's decisions.

In this technical deep-dive, we demonstrate how to bridge the visibility gap by integrating OpenTelemetry with the MCP ecosystem. We will explore how to implement W3C Trace Context propagation across MCP clients (like Claude or custom agents) and MCP servers (tool providers). Using the latest OpenTelemetry GenAI Semantic Conventions, we’ll show how to capture critical agentic signals: reasoning steps, tool-calling intent, and token consumption.

The session features a live demonstration of a Python-based AI agent calling a TypeScript MCP tool server, linked by a single, unified OTel trace. Attendees will learn how to transform opaque AI "thoughts" into actionable, structured telemetry that can be debugged, audited, and scaled.

Taming GPU Chaos: Practical Kubernetes Policies for ML Workloads

Your Kubernetes cluster was humming along fine until the ML team showed up. Suddenly one training job is eating every GPU in the cluster, inference pods are getting OOMKilled, and nobody knows which team's model is costing what. This talk focuses on one specific, painful problem: how platform teams can enforce fair, safe resource governance for ML workloads on shared Kubernetes infrastructure without becoming bottlenecks. Using Kyverno as the policy engine, we'll walk through real-world patterns including automatic resource quota enforcement for GPU requests so a single runaway training job can't starve production, namespace-level guardrails that give data scientists self-service deployment within safe boundaries, and labeling and annotation policies that make cost attribution and chargeback actually possible. You'll walk away with a simple ML readiness policy bundle you can apply to your cluster on Monday. No giant platform diagrams. No "just build an internal developer platform" hand-waving. Just practical policy-as-code patterns that solve the most common GPU scheduling headaches.

Open Source Isn’t Just Code

Open source is often introduced as “just pick an issue and submit a PR,” but for newcomers this advice is vague and intimidating. As a result, many first-time contributors either give up early or submit low-impact contributions that don’t lead to long-term involvement.

In this talk, I’ll share a practical, experience-based introduction to open source contribution that goes beyond code. I’ll walk through real entry points such as documentation fixes, issue triaging, community support, testing, and improving developer experience—roles that are critical to project health but rarely explained clearly.

I’ll also cover how to choose the right project, understand maintainer expectations, communicate effectively on issues and pull requests, and avoid common beginner mistakes that silently block contributions. The session includes examples from real open source projects, showing how contributors grew from their first interaction to sustained involvement.

This talk is aimed at students, early-career engineers, and anyone who wants to contribute to open source but doesn’t know how to start in a way that actually helps the project.

Observability as the Missing Layer for Sovereign AI Infrastructure

As organizations race to adopt AI, a critical blind spot is emerging: most AI workloads running on cloud-native infrastructure lack meaningful observability and when sovereignty requirements enter the picture, this gap becomes a serious governance risk.
This talk explores how OpenTelemetry can serve as the open, vendor-neutral observability backbone for sovereign AI deployments. We'll walk through how to instrument AI agent pipelines from LLM inference calls to multi-agent orchestration — using OTel semantic conventions, and why this matters when your data must stay within jurisdictional boundaries

From 100 to 100,000 RPS: Scaling gRPC Services in Production Without Breaking the Bank

This talk presents a comprehensive journey through scaling gRPC services from startup-level traffic to enterprise-scale demands, drawing from real production experiences at companies processing millions of requests daily. Starting with a modest 100 RPS monolithic service, we'll explore the architectural decisions, performance optimizations, and cost management strategies that enabled scaling to 100,000+ RPS without proportional infrastructure cost increases. The session covers critical scaling bottlenecks including connection pooling, load balancing strategies, circuit breakers, and resource optimization techniques specific to gRPC's HTTP/2 multiplexing capabilities. Through detailed performance benchmarks and cost analysis, this talk demonstrates how thoughtful gRPC implementation choices can achieve 10x traffic growth with only 3x infrastructure costs, making it essential viewing for engineering teams planning for scale.

Community-Centric gRPC: Building Learning Ecosystems Around Complex Tech

gRPC can be complex—high performance, yes, but also intimidating for newcomers. In this talk, I explore how we can make gRPC more accessible and inclusive by building community-driven learning ecosystems. Drawing from real-world community contributions, DevRel experiences, and open-source education efforts, I'll share strategies for documentation, live demos, mentorship, and platform advocacy that empower more developers to adopt and understand gRPC. Whether you’re a project maintainer, contributor, or passionate about developer experience, this session will give you practical insights into growing a healthy, scalable, and welcoming ecosystem around complex cloud-native tech like gRPC.

Bringing Vector Search to Production with OpenSearch

Vector search has moved well beyond experimentation—teams are now running semantic search and retrieval-augmented generation workloads in production. However, many implementations struggle once real traffic, cost constraints, and reliability requirements show up.

In this session, I’ll walk through how to take a vector search setup from prototype to production using OpenSearch. I’ll start with a simple architecture overview, embedding generation, indexing strategies, and query flows and then dive into the decisions that matter most in real systems. This includes choosing between dense and sparse vectors, shard and replica planning, balancing recall vs latency, and combining vector search with traditional keyword ranking using hybrid approaches.

I’ll also share how we evaluated performance using practical metrics, what we monitored in production, and how we handled reindexing and cost control as data grew. The talk includes a short demo and a checklist you can reuse to decide when approximate search is “good enough” and when exact search is worth the cost.

Beyond the SDK: Contributing to the Future of OTel Semantic Conventions

Writing code is only half the battle in observability; the real challenge is ensuring that code speaks a universal language. As we move into 2026, the OpenTelemetry Semantic Conventions have expanded far beyond simple HTTP metrics into complex domains like GenAI, CI/CD pipelines, and Security. But how do these "standard names" actually get decided, and how can you influence them?

This session pulls back the curtain on the Semantic Conventions Special Interest Group (SIG). We will walk through the lifecycle of a convention: from identifying a gap in the OTel Registry to drafting a YAML-based proposal and navigating the stabilization process. We will also introduce OTel Weaver, the new CLI engine designed to automate the validation and generation of these conventions. Whether you are an end-user needing a custom domain registry or a developer wanting to contribute to the global standard, this talk provides a clear roadmap for moving "Beyond the SDK" and helping define the future grammar of observability.

Chaos Engineering for Security: Breaking Systems to Strengthen Defenses

We often hear about chaos engineering in the context of reliability, but what if we applied that same philosophy to security? In this session, I’ll explore the emerging field of Security Chaos Engineering. In this innovative practice, we intentionally inject failures and simulate attacks to uncover hidden security weaknesses before adversaries do.

Using open source tools like ChaosMesh, LitmusChaos, and KubeArmor, I'll demonstrate how teams can proactively test assumptions about their security posture. From simulating pod compromise in Kubernetes to testing firewall rule effectiveness under duress, the session will walk through real-world scenarios where controlled chaos leads to deeper system hardening.

Rather than reacting to incidents, what if we could break things on purpose—and make our systems safer.

Speaker

Pratik Mahalle

Actions

Links

Area of Expertise

Topics

Sessions