Speaker

Zain Malik

Zain Malik

Principal Software Engineer @ Exostellar

Vienna, Austria

Actions

Principal Engineer building GPU optimization and Kubernetes infrastructure platforms for AI/ML workloads. I design and build scheduling and orchestration solutions for modern accelerators, from device integration to fleet-scale cluster management.

Previously led infrastructure teams, scaling to 40,000+ pods across 5,000 nodes. Passionate about giving back, you might catch me at KubeCon sharing lessons from the trenches.

Obsessed with reliability and performance at scale. Track record of shipping.

Area of Expertise

  • Information & Communications Technology

Topics

  • kubernates
  • GPU
  • GPU-Accelerated Clusters
  • distributed systems
  • performance tuning

Unleashing the Power of Cluster API: Extensibility and Customization

Cluster API, designed with extensibility at its core, has revolutionized Kubernetes cluster management. Its open and pluggable architecture empowers providers to implement custom solutions tailored to their unique requirements. In this session, we will explore how Cluster API's extension-by-design philosophy has opened new horizons for organizations seeking to create bespoke Kubernetes clusters.

Managing Kubernetes clusters at scale presents unique operational challenges that cannot be tamed with manual operations.

Through real-world examples and lessons learned, we will demonstrate how Cluster API's flexibility allows for the integration of diverse infrastructure providers and the implementation of organization-specific customizations. Attendees will gain insights into best practices for extending Cluster API, including developing custom controllers, integrating third-party tools, and creating bespoke workflows.

Fix First, Investigate Later: When an eBPF Rollout Brought Down Our Network

When your production network suddenly starts dropping packets, the last thing you expect is that your cloud provider quietly deployed a new monitoring tool. This talk shares our journey from mysterious outage to desperate fix to surprising discovery.

It started with alerts: packet loss spiking, network throughput crashing from 800MB/s to near 250MB/s. No recent changes on our end. Hours into the crisis, we discovered an unfamiliar DaemonSet running eBPF programs - Retina, silently rolled out across our clusters. But here's the catch: we couldn't remove it. The daemonset was reconciled instantly back to original state after an update.
With users impacted and no time for root cause analysis, we took a leap: build a mutation webhook to intercept and neuter this mysterious DaemonSet. It worked instantly - networks recovered, crisis averted.

Only then could we investigate: How did an eBPF observability tool cause such devastation? And why didn't we know it was being deployed?

But What About Reliability? - The Multi-Million Dollar Kubernetes Cost Optimization Question

“But what about reliability?” We heard this question 865 times when staring at 9% CPU utilization. Every time followed by a VM-era horror story or a revenue shield - "We bring in millions in revenue; we deserve idle resources for peace of mind”.

This session reveals 9 battle-tested Kubernetes-native strategies that took us from 9% to 50% utilization while IMPROVING reliability. The same directors who predicted “catastrophic failure” now champion optimization, panic-paging our team if costs regress.

Discover practical implementations and pitfalls, such as tuning workload limits, too many pods on nodes, API server pressure, reliable spot nodes, etc. You can selectively adopt and combine these strategies to build your own multi-dimensional cost optimization blueprint, precisely tailored to address the distinct challenges of your platform. Every technique uses open-source CNCF tools, because the most expensive infrastructure isn’t compute - it's fear.

KubeCon + CloudNativeCon North America 2025 Sessionize Event

November 2025 Atlanta, Georgia, United States

KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 Sessionize Event

August 2024 Hong Kong

Zain Malik

Principal Software Engineer @ Exostellar

Vienna, Austria

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top