Nishant Gupta

Nishant Gupta

Tech Lead, Software Engineering @ Meta SuperIntelligence Lab (MSL) • AI Infrastructure • Distributed Systems • Researcher • Speaker • Startup Advisor

San Francisco, California, United States

Actions

# Introduction
I am a Staff Software Engineer and Researcher at Meta, specializing in large-scale distributed systems and applied AI. I am passionate about building reliable, scalable, and intelligent infrastructure that powers the next generation of agentic workflows. With deep expertise spanning large-scale distributed systems, agentic infrastructure, systems architecture, and operational resilience, I focus on solving the hardest problems at the intersection of systems, AI, and real-world execution where theory meets engineering tradeoffs.

Within Meta SuperIntelligence Lab (MSL), I have contributed to building agentic infrastructure - systems where AI agents operate within structured distributed environments, interacting with monitoring, scheduling, and feedback loops. My work in this space focuses on:
- Evaluation and auditing of AI-driven decision systems in high-stakes production environments
- Reliability, safety, and human oversight in autonomous and semi-autonomous systems
- Designing feedback mechanisms to align system behavior with user and operational goals
- Measuring real-world impact beyond offline metrics

# Building Elastic Compute Infrastructure at Meta
I also built the next-generation of elastic compute infrastructure to increase overall fleet utilization responsible for managing ~30% of Meta’s capacity (tens of millions of servers) across ~20 geo-distributed datacenter saving billions of dollars in Capex. This also involved partnering with VPs across Ads/Whatsapp/IG/Finance/Infra to set multi-year roadmap and strategy for increasing fleet-wide efficiency.

# Research
At Meta, my recent research includes Dynamic Idle Resource Leasing to Safely Oversubscribe Capacity at Scale, where I designed and deployed a production system that improves datacenter utilization by leasing idle capacity while preserving reliability and strict SLO guarantees. This work required building rigorous evaluation frameworks spanning simulation, controlled experimentation, and real-world safety validation - balancing algorithmic optimization with operational risk. The system has delivered measurable infrastructure-efficiency gains at production scale. I have also authored papers with 90+ citations.

# What I care about
I think deeply about how distributed services communicate, self-coordinate, and act with reliability under ambiguity. My work is rooted in understanding latency, correctness, failure modes, and semantic interoperability - not just performance on benchmarks, but real-­world outcomes that matter in production. I’ve led teams and initiatives that:
- Architect complex distributed platforms that serve high-availability workloads at scale
- Design agentic systems and frameworks that enable coordinated autonomous behavior across services and models
- Build operationally robust infrastructure with strong observability, fault tolerance, and graceful degradation
- Translate cutting-edge research into developer-ready systems and patterns

# Education
I graduated from University of California, Los Angeles (UCLA) with a Master's in Computer Science in December 2019. At UCLA, my focus area was on building scalable distributed systems leveraging Machine Learning.

# Ways to collaborate:
• Keynotes, conference talks, and technical workshops
• Partnerships with AI platforms, developer tools, and education organizations
• Advisory and consulting on AI infrastructure and large-scale systems

For speaking, partnerships, or advisory inquiries: nishantgupta@g.ucla.edu

Area of Expertise

  • Finance & Banking
  • Health & Medical
  • Information & Communications Technology
  • Law & Regulation
  • Media & Information

Topics

  • Agentic Infra
  • Artificial Intelligence
  • Machine Learning
  • Compute Infrastructure
  • Distributed Systems
  • platform-engineering
  • capacity-planning
  • cloud-infrastructure
  • ai-infrastructure
  • performance-engineering
  • observability
  • reliability
  • Data Observability
  • Llm observability
  • Agentic AI architecture
  • ML
  • Leadership
  • Career Growth
  • engineering leadership
  • AI at Scale
  • ML Infrastructure
  • Mentorship in Engineering
  • Building at Hyperscale
  • Performance Engineering
  • Observability
  • Streaming
  • Event Streaming
  • Streaming Data Analytics
  • Value stream management
  • Reinforced Reasoning Systems

PRO Session: Agentic Infrastructure: Building Reliable Systems for Autonomous AI Workflows

AI systems are rapidly evolving from passive models into autonomous agents capable of executing complex workflows across APIs, services, and cloud infrastructure. But deploying these systems in production introduces a new class of distributed systems challenges.

In this talk, we explore agentic infrastructure - the systems required to safely run AI agents at scale. Drawing from real-world experience building large-scale infrastructure at Meta, we will examine how agent runtimes interact with APIs, tools, and microservices, and why traditional reliability patterns break down.

We’ll cover:

1. Failure modes unique to AI agents (hallucinated actions, cascading retries)
2. Guardrail architectures and action validation pipelines
3. Observability for non-deterministic systems
4. Cost and safety control mechanisms

Attendees will leave with a blueprint for building production-grade AI systems that are reliable, observable, and safe.

Operating Distributed Inference Systems at Scale

Inference has rapidly become one of the most important infrastructure problems in modern computing. As AI systems evolve into autonomous agents with persistent memory, tool usage, and multi-step reasoning, traditional inference architectures struggle under growing demands for latency, throughput, cost efficiency, and reliability.

In this talk, I’ll share lessons from building large-scale elastic compute and AI infrastructure systems powering production workloads. We’ll explore the modern inference stack and the architectural patterns emerging to support next-generation agentic AI systems.

Topics include:

Distributed inference architectures for large-scale AI systems
GPU scheduling and elastic compute for inference workloads
Multi-tenant inference infrastructure
Caching, batching, and latency optimization strategies
Reliability and fault isolation for inference systems
Observability and control loops for AI serving platforms
Balancing cost, throughput, and user experience
Why inference is becoming an infrastructure orchestration problem

Attendees will gain practical insights into designing scalable, resilient, and cost-efficient inference platforms for modern AI workloads.

Agentic Infrastructure in the Cloud: Running Autonomous AI Systems Safely at Scale

AI systems are rapidly evolving from passive models into autonomous agents capable of planning and executing workflows across cloud services and APIs. While advances in large language models enable powerful agent capabilities, deploying these systems in real production environments introduces new engineering challenges around reliability, safety, observability, and governance.

This talk explores agentic infrastructure the distributed systems required to safely run AI agents within cloud environments. We will examine how agent runtimes interact with tools, APIs, and microservices, and why traditional cloud reliability patterns are often insufficient for autonomous decision systems.

The session presents practical design patterns for building production-grade agent platforms, including action validation pipelines, guardrails, evaluation loops, and human-in-the-loop oversight. We will also discuss strategies for preventing cascading failures, controlling cost and behavior, and making agent decisions observable and auditable.

Attendees will leave with a systems-level blueprint for building reliable, safe, and observable AI agents in modern cloud infrastructure.

Nishant Gupta

Tech Lead, Software Engineering @ Meta SuperIntelligence Lab (MSL) • AI Infrastructure • Distributed Systems • Researcher • Speaker • Startup Advisor

San Francisco, California, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top