Speaker

Nikunj Goyal

Nikunj Goyal

Member of Technical Staff

Actions

Hi, I am Nikunj Goyal, a Machine Learning and Systems Engineer at Adobe, building scalable products at the intersection of Generative AI and vector graphics. My work spans full-stack engineering, model training, GPU optimization, and vector graphics bringing creative tools to life with deep learning.

I have been a speaker at KubeCon North America, Europe, and India speaking on topics such as GPU scheduling, intelligent orchestration, and AI in cloud-native areas. I love to explore more of cloud native stuff and be part of vibrant CNCF community.

No More GPU Cold Starts: Making Serverless ML Inference Truly Real-Time

Serverless ML inference is great but when GPUs are involved, cold starts can turn milliseconds into minutes. Whether scaling transformer models or using custom inference services, the startup latency caused by container initialization, GPU driver loading, and heavyweight model deserialization can kill real-time performance and cost you tons of money.

In this talk, we'll break down the anatomy of GPU cold starts in modern ML serving stacks including why GPUs introduce unique cold-path delays, how CRI and device plugins contribute to it, and what really happens when a PyTorch model boot-up on a fresh pod.

We’ll walk through production-ready strategies to reduce startup latency:
- Pre-warmed GPU pod pools to bypass init time
- Model snapshotting with TorchScript or ONNX to speed up deserialization
- Lazy loading techniques that delay model initialization until the first request

Thus helping you eliminate cold start pain and keep your services fast, efficient, and production-ready.

Unlocking the Future of GPU Scheduling in Kubernetes with Reinforcement Learning

Scaling up Multi GPU setup using Kubernetes for large scale ML projects has been a hot topic equally stressed upon among both the AI and cloud community. While Kubernetes is able to providing computing power by scheduling GPU nodes, certain issues like resource fragmentation and low utilization plague the performance and results in cost issues.
Why Reinforcement Learning (RL) in particular one would ask. Unlike the other algorithms, RL shines in its unique ability to continuously adapt to changing environments and efficiently handle Complex and Multi-dimensional Objectives making it particularly suitable for the dynamic and heterogeneous nature of Kubernetes clusters.
In this talk, we shall explore the current landscape of GPU scheduling and some state of the art RL algorithms proposed for scheduling. Their current impact on Kubernetes and the possible use of RLHF shall be dived deep into. We hope that audience gain more insights into these new ways of scheduling GPUs on Kubernetes.

Tiny Models Big Ideas : Quantization for Smarter Inference

With the rise of on-device intelligence, the push to run LLMs on edge hardware — phones, Raspberry Pis, even microcontrollers — is accelerating. At the heart of this revolution is quantization: the art of shrinking models without shrinking their intelligence.

This talk breaks down quantization by walking through how it’s evolved from basic tricks to the smart, low-bit methods powering today’s compact LLMs. We'll unpack Quantization tracing how post-training quantization (PTQ) and quantization-aware training (QAT) evolved into smarter methods like GPTQ, AWQ, and SmoothQuant, each balancing performance, accuracy, and deployability.

We’ll also dig into the growing toolbox of frameworks that are making it easier than ever to get these models running fast on real hardware — including vLLM, TensorRT-LLM, GGML, and MLC-LLM.

To wrap it up, we’ll look at real-world examples of quantized LLMs running on edge devices — and see what actually works, what breaks, and how far you can push performance without blowing up memory or latency. If you’re curious about how much model you can fit into a few megabytes — and still get useful completions — this talk is for you.

Scaling ML Smarter: Optimizing Kueue & Volcano with Adaptive Scheduling

Kueue and Volcano are leading the charge in orchestrating large-scale distributed ML jobs. But are they truly maximizing your GPU resources? Traditional batch scheduling methods often suffer from inefficient queue management, and rigid allocations that fail to adapt to real-time demand resulting in problems that scale with workloads.

This talk dives into how priority-aware queueing and elastic resource allocation can supercharge Kueue and Volcano, making batch scheduling more adaptive and efficient. We’ll break down the scheduler’s architecture, exploring how jobs dynamically move between priority queues, how elastic scheduling adjusts resource allocations in real time, and how these improvements lead to faster job execution and better GPU utilization.

Whether you're managing distributed training, hyperparameter tuning, or large-scale inference pipelines, this talk will provide the tools and strategies needed to unlock smarter scheduling and maximize ROI on Kubernetes GPU workloads.

Bridge Infrastructure & Application - OpenFeature for Dynamic Feature Management in K8 Deployments

As cloud-native systems scale, managing application features dynamically without disrupting services is a cornerstone of modern software delivery. In this talk, we'll delve into the integration of Kubernetes with OpenFeature, a powerful open standard for feature flag management, to bridge application logic with infrastructure orchestration.

The session will explore Kubernetes-native resources like ConfigMaps, Secrets, and external flag services to store and manage feature configs, enabling real-time toggling without service restarts. Combined with Kubernetes' rolling updates, it ensures rapid recovery from critical issues, safeguarding system stability. This integration redefines how developers roll out features, offering unprecedented control for A/B testing, and canary deployments.

By combining K8s' orchestration power and OpenFeature’s runtime control, this approach not only redefines feature management but also aligns with the future of scalable, adaptive cloud-native ecosystems.

Unlocking the Power of GPUs in Kubernetes The Full Spectrum of Scheduling & Resource Sharing

As the demand for GPU-accelerated workloads grows, Kubernetes has become essential for managing containerized applications that rely on GPUs. However, scheduling GPU resources introduces unique challenges, such as inefficient resource utilization, vendor-specific differences, and the need for fine-grained resource management.

Starting with how GPUs have become an integral part of modern clusters, we will dive deep into the key components of GPU scheduling in Kubernetes, architecture and scheduling policiy of native Kube scheduler and some of the prominent challenges it face as of today. This talk further explores how MIG enables partitioning of single GPU to multiple instances, DRA enhances GPU management in kubernetes and the concept of Fractional GPUs - how these transform the whole process of managing and scheduling GPUs in K8 clusters

Attendees will gain insights into how these work in Kubernetes along with best practices for managing GPU resources in large-scale environments.

CNCF-hosted Co-located Events Europe 2025 Sessionize Event

April 2025 London, United Kingdom

KubeCon + CloudNativeCon North America 2024 Sessionize Event

November 2024 Salt Lake City, Utah, United States

Nikunj Goyal

Member of Technical Staff

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top