Ankur Gupta
LinkedIn, Senior Staff Technical Program Manager
San Jose, California, United States
Actions
Ankur Gupta is a Senior Staff Technical Program Manager in AI infrastructure at LinkedIn, where he leads large-scale programs serving over a billion members across capacity planning, GPU efficiency, ML production reliability, and data center build-outs, including a first-of-its-kind hardware forecasting control plane. With 15+ years across AI infrastructure, cloud platforms, and distributed systems at companies including LinkedIn and Walmart, he turns hard technical problems into scalable platforms with measurable business impact. He is an IEEE Senior Member.
Area of Expertise
Topics
Why ML Deployments Fail in Production And How to Engineer Reliability at Scale
Machine learning breakthroughs often receive significant attention, but many organizations struggle with a more fundamental challenge: reliably deploying those models into production systems. Models that perform well in research environments frequently encounter unexpected failures once they interact with real-world infrastructure, data pipelines, and production workloads.
Production ML systems must coordinate across multiple layers including model training pipelines, data dependencies, deployment infrastructure, and serving systems. Small gaps in readiness across any of these layers can lead to failed deployments, unstable rollouts, or degraded system performance.
This session explores common failure modes in machine learning deployments and examines why production ML systems are often harder to operate than traditional software systems. Drawing on real-world platform engineering experiences, the talk highlights practical strategies organizations use to improve deployment reliability, reduce operational friction, and enable teams to move models safely from experimentation to large-scale production environments.
Key Learnings
• Why machine learning deployments frequently fail in production environments
• The hidden system dependencies that complicate ML deployment pipelines
• How reliability challenges differ between traditional software systems and ML systems
• Platform engineering strategies that help improve deployment success rates
• Practical approaches for enabling safer and more reliable model rollouts
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top