Ankur Gupta

LinkedIn, Senior Staff Technical Program Manager

San Jose, California, United States

Actions

Ankur Gupta is a Senior Staff Technical Program Manager in AI infrastructure at LinkedIn, where he leads large-scale programs serving over a billion members across capacity planning, GPU efficiency, ML production reliability, and data center build-outs, including a first-of-its-kind hardware forecasting control plane. With 15+ years across AI infrastructure, cloud platforms, and distributed systems at companies including LinkedIn and Walmart, he turns hard technical problems into scalable platforms with measurable business impact. He is an IEEE Senior Member.

Area of Expertise

Business & Management
Information & Communications Technology

Topics

ML/AI Infrastructure
Infrastructure capacity planning
Reliability Engineering
Data Platform
systems engineering
software infrastructure
Technical Program Management
Technical Architecture

Why ML Deployments Fail in Production And How to Engineer Reliability at Scale

Machine learning breakthroughs often receive significant attention, but many organizations struggle with a more fundamental challenge: reliably deploying those models into production systems. Models that perform well in research environments frequently encounter unexpected failures once they interact with real-world infrastructure, data pipelines, and production workloads.

Production ML systems must coordinate across multiple layers including model training pipelines, data dependencies, deployment infrastructure, and serving systems. Small gaps in readiness across any of these layers can lead to failed deployments, unstable rollouts, or degraded system performance.

This session explores common failure modes in machine learning deployments and examines why production ML systems are often harder to operate than traditional software systems. Drawing on real-world platform engineering experiences, the talk highlights practical strategies organizations use to improve deployment reliability, reduce operational friction, and enable teams to move models safely from experimentation to large-scale production environments.

Key Learnings
• Why machine learning deployments frequently fail in production environments
• The hidden system dependencies that complicate ML deployment pipelines
• How reliability challenges differ between traditional software systems and ML systems
• Platform engineering strategies that help improve deployment success rates
• Practical approaches for enabling safer and more reliable model rollouts

Ankur Gupta

LinkedIn, Senior Staff Technical Program Manager

San Jose, California, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Ankur Gupta

Actions

Links

Area of Expertise

Topics

Sessions

Why ML Deployments Fail in Production And How to Engineer Reliability at Scale

Ankur Gupta

Links

Actions