Abdel Sghiouar

Cloud Developer Advocate

Actions

Abdel Sghiouar is a senior Cloud Developer Advocate @Google Cloud. A co-host of the Kubernetes Podcast by Google and a CNCF Ambassador. His focused areas are GKE/Kubernetes, Service Mesh and Serverless. Abdel started his career in datacenters and infrastructure in Morocco where he is originally. Before moving to Google's largest EU datacenter in Belgium. Then in Sweden he joined Google Cloud Professional Services and spent 5 years working with Google Cloud customers on architecting and designing large scale distributed systems before turning to advocacy and community work.

Badges

Pod Right-sizing in the Second Decade of Kubernetes

Historically, optimizing resource allocation for Kubernetes workloads was a painful trial-and-error process, forcing developers to choose between high startup costs or lengthy delays. This constant struggle to find the "just right" balance for pod resources diverted valuable time from feature development.

But with In-Place Pod Resize (IPPR) in Kubernetes, those days are over. IPPR streamlines resource management by allowing you to dynamically resize pods without a restart, opening the door to true right-sizing and vastly improved bin packing.

This talk will explore the benefits of IPPR, demonstrate how to leverage it for optimal resource allocation, and show its integration with Vertical Pod Autoscaler (VPA) to provide startup boosts for your applications.

Optimizing LLM Inference for the Rest of Us

Not every organization operates with the hyperscale resources of Anthropic, Google, or OpenAI. For the majority of businesses integrating Large Language Models (LLMs) into their critical paths, the high costs and scarcity of GPU/TPU accelerators present a significant challenge. Striking the balance between performance, availability, scalability, and cost-efficiency is a must.

While Kubernetes is a ubiquitous runtime for modern workloads, deploying LLM inference effectively demands a specialized approach. This session dives deep into practical strategies for optimizing your Kubernetes clusters and LLM Inference workloads to run efficiently and cost effectively. We will explore:

- Container and Model Optimization
- Accelerator Management
- Data & Storage
- Network & Load Balancing
- Observability

Attendees will leave with practical techniques for maximizing cost/performance for LLM inference for their AI-powered applications on Kubernetes.

Introducing Kubernetes Resource Orchestrator (KRO)

Providing application teams with a self-service way of deploying applications and their dependencies often means that platform administrators have to hide the implementation details of the platform via simple to consume APIs. In the case of Kubernetes this usually means having to deploy Custom Resource Definition that are either 3rd party or custom built in-house. These CRD’s in addition to allowing Kubernetes users to manage non-Kubernetes objects via the Kubernetes Resources Model (Aka YAML file), they also allow abstracting away the details of how some resources get created and managed.

In this session, you'll learn how KRO allows platform teams to:
- Create high-level APIs that reduce YAML complexity while maintaining flexibility
- Support both native Kubernetes and cloud-specific resources for more efficient orchestration
- Integrate KRO into your Kubernetes workflows for better scalability and simplicity

Gemini 2.0 pour les développeurs

Découvrez Gemini 2.0, le nouveau modèle expérimental d'IA générative de Google. Apprenez à créer des applications vocales et vidéo en temps réel avec l'API Multimodal Live, à intégrer Google Search pour créer des flux de travail avancés, et à détecter des objets dans des images et vidéos en utilisant du text en langage naturel. Explorez la compréhension multimodale améliorée de Gemini 2.0, ses capacités de codage et sa capacité à suivre des instructions complexes – ce qui en fait un outil idéal pour développer des agents d'IA. De plus, découvrez où trouver tous les excellents notebooks et tutoriels pour en apprendre davantage. Après cette session, vous serez prêt à construire avec Gemini 2.0 sur Google Cloud et d'autres plateformes.

Hands-on with Ray on Kubernetes

The rapidly evolving landscape of Machine Learning and Large Language Models demands efficient scalable ways to run distributed workloads to train, fine-tune and serve models. Ray is an Open Source framework that simplifies distributed machine learning, and Kubernetes streamlines deployment.
In thise hands-on session we will explore Ray as a framework and how it integrates with Kubernetes to run scalable distributed machine learning workloads. We will cover Ray scalability, patterns for running RayJobs and RayServe and will cover best practices for creating multi-tenant ML platforms using Ray on Kubernetes with fair-sharing of scarce hardware accelerators.we'll uncover how to combine Ray and Kubernetes for your ML projects.

What’s new in the Kubernetes Gateway API

The Gateway API was Introduced to Kubernetes in 2019. The project is making a sturdy progress toward becoming the single expressive API for Inbound traffic that is portable, extensible and role-oriented. With over 20 Implementations and multiple objects making it to GA recently. This session is about exploring what’s happening in the project. What is the state of the API and the various implementations? We will also cover the GAMMA initiative which aims at using the Gateway API as a standard way to describe East-West traffic (AKA mesh traffic).

Working with Gemma and Open LLMs on Google Kubernetes Engine

The Gemma family of open models can be fine-tuned on your own custom dataset to perform a variety of tasks, such as text generation, translation, and summarization. Combined with Kubernetes, you can unlock the open source AI innovations with scalability, reliability, and ease of management.

In this workshop, you will learn through a guided hands-on exercise how you can work with Gemma and fine-tune it on a Kubernetes cluster. We will also explore options for serving Gemma on Kubernetes with accelerators and Open Source tools.

Introducing LLM Instance Gateways for Efficient Inference Serving

Large Language Models (LLMs) are revolutionizing applications, but efficiently serving them in production is a challenge. Existing API endpoints, LoadBalancers and Gateways focus on HTTP/gRPC traffic which is a well defined space already. LLM traffic is completely different as an input to an LLM is usually characterized by the size of the prompt, the size and efficiency of the model...etc

Why are LLM Instance Gateways important? They solve the problem of efficiently managing and serving multiple LLM use cases with varying demands on shared infrastructure.

What will you learn? The core challenges of LLM inference serving: Understand the complexities of deploying and managing LLMs in production, including resource allocation, traffic management, and performance optimization.

We will dive into how LLM Instance Gateways work, how they route requests, manage resources, and ensure fairness among different LLM use cases.

Advanced Ray for distributed ML on Kubernetes

Modern machine learning workloads demand scalable, flexible infrastructure that can handle complex computational requirements. This talk explores how Ray, an open-source unified framework, makes distributed machine learning on Kubernetes easier with its advanced capabilities.

In this talk we will explore Ray Integration with Kubernetes to run scalable distributed machine learning workloads. We will cover Ray scalability, patterns for running RayJobs and RayServe and will cover best practices for creating multi-tenant ML platforms using Ray on Kubernetes with fair-sharing of scarce hardware accelerators.

Yes you can run LLMs on Kubernetes

As LLMs become increasingly powerful and ubiquitous, the need to deploy and scale these models in production environments grows. However, the complexity of LLMs can make them challenging to run reliably and efficiently. In this talk, we'll explore how Kubernetes can be leveraged to run LLMs at scale.

We'll cover the key considerations and best practices for packaging LLM inference services as containerized applications using popular OSS inference servers like TGI, vLLM and Ollama, and deploying them on Kubernetes. This includes managing model weights, handling dynamic batching and scaling, implementing advanced traffic routing, and ensuring high availability and fault tolerance.

Additionally, we'll discuss accelerators management and serving models on multiple hosts. By the end of this talk, attendees will have a comprehensive understanding of how to successfully run their LLMs on Kubernetes, unlocking the benefits of scalability, resilience, and DevOps-friendly deployments.

Introduction to Distributed ML Workloads with Ray on Kubernetes

The rapidly evolving landscape of Machine Learning and Large Language Models demands efficient scalable ways to run distributed workloads to train, fine-tune and serve models. Ray is an Open Source framework that simplifies distributed machine learning, and Kubernetes streamlines deployment. In this introductory talk, we'll uncover how to combine Ray and Kubernetes for your ML projects. You will learn about:
- Basic Ray concepts (actors, tasks) and their relevance to ML
- Setting up a simple Ray cluster within Kubernetes
- Running your first distributed ML training job

Distributed Fine Tuning of Open LLMs on Kubernetes

Open LLMs are a family of ML models that can be fine-tuned on your own custom dataset to perform a variety of tasks, such as text generation, translation, and summarization. Combined with Kubernetes, you can unlock the open source AI innovations with scalability, reliability, and ease of management.
In this session, we will deep dive into how you can fine-tune Open LLMs on a Kubernetes cluster. We will also explore options for serving LLms on Kubernetes with accelerators and Open Source tools

Training and Serving LLMs on Kubernetes: A beginner’s guide.

Large Language Models (LLMs) are revolutionizing natural language processing, but their size and complexity can make them challenging to deploy and manage. This talk will provide a beginner-friendly introduction to using Kubernetes for training and serving LLMs.
We'll cover:
The Basics of Kubernetes: This is a quick overview of core Kubernetes concepts (pods, containers, deployments, services) essential for understanding LLM deployment.
LLMs and Resource Demands: This section discusses LLMs' unique computational resource requirements and how Kubernetes helps manage them effectively.
Training LLMs on Kubernetes: Practical guidance on setting up training pipelines, addressing data distribution, and model optimization within a Kubernetes environment.
Serving LLMs for Inference: Walkthroughs of strategies for deploying LLMs as services, load balancing, and scaling to handle real-world traffic.
If you're interested in harnessing the power of LLMs for your projects, this talk will provide a solid foundation for utilizing Kubernetes to streamline your workflow

LLM Observability with OpenTelemetry on Kubernetes

Large Language Models (LLMs) have gained significant prominence due to their diverse applications, ranging from conversational agents to code generation assistants. Given their increasing deployment in production environments, understanding and monitoring LLM behavior has become crucial for effective implementation and risk management.

Observability for LLMs goes beyond monitoring what prompts are sent to the model and what responses are received, it includes also monitoring the application making the call in a distributed system, and considering the wide range of options for using a Large Language Model from using cloud hosted versions to using local open models. Kubernetes is a common platform for deploying the apps and the LLMs

In this session we will explore how OpenTelemetry, the Open Source de facto tool for Logging, Monitoring and Tracing can be used on top of Kubernetes to keep an eye on applications and LLMs behavior. We will explore tracing calls, monitoring prompts and parameters and costs.

Abdel Sghiouar

Cloud Developer Advocate

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Most Active Speaker

Abdel Sghiouar

Actions

Links

Badges

Sessions

Pod Right-sizing in the Second Decade of Kubernetes

Optimizing LLM Inference for the Rest of Us

Introducing Kubernetes Resource Orchestrator (KRO)

Gemini 2.0 pour les développeurs

Hands-on with Ray on Kubernetes

What’s new in the Kubernetes Gateway API

Working with Gemma and Open LLMs on Google Kubernetes Engine

Introducing LLM Instance Gateways for Efficient Inference Serving

Advanced Ray for distributed ML on Kubernetes

Yes you can run LLMs on Kubernetes

Introduction to Distributed ML Workloads with Ray on Kubernetes

Distributed Fine Tuning of Open LLMs on Kubernetes

Training and Serving LLMs on Kubernetes: A beginner’s guide.

LLM Observability with OpenTelemetry on Kubernetes

Abdel Sghiouar

Links

Actions