Speaker

Rudraksh Karpe

Rudraksh Karpe

Forward Deployed Engineer

Bengaluru, India

Actions

Rudraksh is Forward Deployed Engineer (FDE) at Simplismart, where he builds solutions focused on high-performance AI inference. He previously worked as an AI Engineer at ZS Associates. He was a two-time Google Summer of Code participant with the openSUSE Project and Previously worked as Deep Learning Reserch Intern at Indian Institiute of Tropical Metrology on Climate Downscaling. Rudraksh has presented internationally at events including OpenSearch Korea, PyCon US, PyCon Japan, openSUSE Conference, Early Adopter Tech Summit (Florida) and openSUSE Asia Summit (Tokyo, Japan), focusing on GenAI, open source, and cloud-native technologies.

Area of Expertise

  • Information & Communications Technology

Topics

  • AI Agents
  • Generative AI
  • LLMs
  • Kubernetes
  • Edge
  • edge ai

Supercharging OpenSearch Clusters with GPU Accelerated Vector Search

Modern AI applications such as semantic search, RAG pipelines, and recommendation systems rely on large-scale vector search across millions to billions of embeddings. As datasets grow, CPU-only OpenSearch clusters struggle with slow vector indexing, rising query latency, and increasing infrastructure costs, making production-grade AI search difficult to operate reliably.

This talk explores how GPU-accelerated vector search transforms OpenSearch into a scalable platform for modern AI workloads. By offloading compute-intensive tasks such as vector index construction and similarity search from CPUs to GPUs, OpenSearch achieves faster indexing, lower query latency, and predictable performance at scale.

Attendees will learn how GPU acceleration can reduce index build times from hours to minutes, increase search throughput, and support large-scale embedding experimentation without impacting production stability.

Recursive Language Models (RLMs): Scaling to Infinite Context via Programmatic Decomposition

In this presentation we present Recursive Language Models (RLMs), developed by MIT CSAIL. RLMs solve the problem of context rot in long inputs by moving the full context outside the model into a Python REPL environment. The main LLM writes code to read, split, search, summarize, and recursively call smaller or cheaper instances of itself on smaller chunks. This keeps the root model’s context small while handling very large inputs.

RLMs work with any existing LLM and scale effectively to 10 million+ tokens without retraining or losing performance. On benchmarks like OOLONG and BrowseComp-Plus, RLMs clearly outperform standard frontier models and common long-context methods, often at similar or lower cost.

We show a simple PyTorch implementation and introduce RLM-Qwen3-8B, a post-trained model that learns native recursion and performs much better than its base on long-context tasks. RLMs offer a practical way to build agentic and deep-research systems today.

Lightning-Fast Knowledge Graphs in Python: Real-Time Multi-Hop Reasoning with NVIDIA cuGraph

Imagine querying complex knowledge graphs in real time—right from Python with all the performance of a GPU supercomputer and none of the usual code headaches. This session reveals how NVIDIA cuGraph turbocharges single-hop, multi-hop, and traversal operations on giant knowledge graphs, cutting response times from seconds to milliseconds. We’ll break down how cuGraph’s GPU-accelerated algorithms work seamlessly with popular Python tools and how you can combine cuGraph with deep learning frameworks like PyTorch for ultra-scalable AI and retrieval-augmented generation (RAG) pipelines. Join us for practical demos, hands-on advice, and approachable insights whether you’re building enterprise reasoning engines, interactive agents, or next-gen graph-powered search. Unlock the full speed of your data, from Python, with just a few lines of code!

Kubernetes as the universal GPU Control Plane for AI workloads

AI workloads are driving huge demand for GPUs and AI accelerators, yet the default Kubernetes model still leans on vendor-specific device plugins, which tie workloads to particular hardware and complicate portability across heterogeneous clusters. In this session, members from the Kubernetes and KAITO projects will present a more unified alternative: coupling HAMi’s device virtualization and unified scheduling abstraction with KAITO’s AI workload automation, transforming Kubernetes into a cross-vendor GPU control plane. Together, they enable cross-vendor accelerator management, reducing lock-in and improving workload portability.

We’ll walk through demos that show how HAMi abstracts device details (splitting, isolation, topology-aware scheduling), while KAITO automates workload lifecycles (model deployment, node provisioning, scaling). Attendees will leave with a practical blueprint for running AI workloads on heterogeneous infrastructure on Kubernetes.

GPU Agnostic AI inference with Ray on Kubernetes

In production AI systems today, 60–70% of inference workloads are tightly coupled to a single GPU vendor or instance type, leading to 30–50% higher infrastructure costs, poor portability, and operational friction when scaling across cloud and on-prem environments. As demand grows, teams face a choice: lock in deeper or redesign for flexibility.

This session presents a GPU-agnostic inference architecture built with Ray on Kubernetes, designed to run reliably across heterogeneous accelerator clusters. By decoupling application logic from hardware assumptions and leveraging Ray’s distributed execution with Kubernetes-native scheduling, teams can scale inference without rewriting pipelines for each GPU type.

Using a production-grade reference architecture, we’ll show how inference traffic flows through Ray Serve, how workloads scale across mixed CPU/GPU nodes, and how concurrency, fault tolerance, and autoscaling are handled under real-world load. We’ll also demonstrate how KubeRay reduces operational overhead by managing Ray clusters through Kubernetes-native lifecycle controls.

Scaling AI Inference Across Heterogeneous GPUs with Ray on Kubernetes

In production AI systems today, 60–70% of inference workloads are tightly coupled to a single GPU vendor or instance type, leading to 30–50% higher infrastructure costs, poor portability, and operational friction when scaling across cloud and on-prem environments. As demand grows, teams face a choice: lock in deeper or redesign for flexibility.

This presentation presents a GPU-agnostic inference architecture built with Ray on Kubernetes, designed to run reliably across heterogeneous accelerator clusters. By decoupling application logic from hardware assumptions and leveraging Ray’s distributed execution with Kubernetes-native scheduling, teams can scale inference without rewriting pipelines for each GPU type.

Using a production-grade reference architecture, we’ll show how inference traffic flows through Ray Serve, how workloads scale across mixed CPU/GPU nodes, and how concurrency, fault tolerance, and autoscaling are handled under real-world load. We’ll also demonstrate how KubeRay reduces operational overhead by managing Ray clusters through Kubernetes-native lifecycle controls.

Persistent AI Memory with OpenSearch: Building Context-Aware Agents that Learn Over Time

AI agents often lose context between interactions, limiting personalization, continuity, and long-term learning. This session introduces persistent agentic memory with OpenSearch, enabling context-aware agents that remember, learn, and improve over time.

We explain how OpenSearch agentic memory provides a durable, searchable memory layer where agents can store and retrieve session history, working context, long-term knowledge, and audit logs. You’ll learn how configurable memory processing extracts facts, preferences, and summaries from interactions, turning raw conversations into lasting intelligence. We also cover namespace design for isolating memory by user, agent, or session to support secure, scalable systems.

In this talk, we’ll demonstrates how to integrate agentic memory with existing AI frameworks using standard REST APIs, allowing seamless connection to LangChain, LangGraph, or custom agent pipelines without vendor lock-in. At the end, we’ll show how memory retrieval directly influences agent decisions and improves personalization over time.

Lightning-Fast Knowledge Graphs in Python: Real-Time Multi-Hop Reasoning with NVIDIA cuGraph

Imagine querying complex knowledge graphs in real time—right from Python—with all the performance of a GPU supercomputer and none of the usual code headaches. This session reveals how NVIDIA cuGraph turbocharges single-hop, multi-hop, and traversal operations on giant knowledge graphs, cutting response times from seconds to milliseconds. We’ll break down how cuGraph’s GPU-accelerated algorithms work seamlessly with popular Python tools and how you can combine cuGraph with deep learning frameworks like PyTorch for ultra-scalable AI and retrieval-augmented generation (RAG) pipelines. Join us for practical demos, hands-on advice, and approachable insights—whether you’re building enterprise reasoning engines, interactive agents, or next-gen graph-powered search. Unlock the full speed of your data, from Python, with just a few lines of code!

Rudraksh Karpe

Forward Deployed Engineer

Bengaluru, India

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top