Kay Yan
Maintainer of kubespray,containerd/nerdctl,LWS(LeaderWorkerSet) and LLM-D, Software Engineer in DaoCloud
kubespray、containerd/nerdctl、LWS(LeaderWorkerSet)和 LLM-D 的维护者,DaoCloud 软件工程师
Shanghai, China
Actions
Principal Software Engineer at DaoCloud, specializing in Kubernetes-native infrastructure for AI and LLM workloads.
He has 15+ years of hands-on experience in Kubernetes, CNCF ecosystems, and large-scale cloud-native infrastructure. He is an active maintainer and contributor across projects including Kubespray, LeaderWorkerSet, containerd/nerdctl, LLM-D, Kubean, Spiderpool, HwameiStor, and Merbridge. His recent work focuses on distributed inference, AI workload orchestration, and production-grade LLM serving on Kubernetes. He holds 23 Chinese patents and 9 U.S. patents in cloud computing and AI infrastructure. Before DaoCloud, he was a Senior Technologist at DELL/EMC and received the EMC Innovation CTO Award for his contributions to PaaS architecture.
DaoCloud 资深软件工程师,专注于面向人工智能和LLM工作负载的Kubernetes原生基础设施。
他在Kubernetes、CNCF生态系统和大规模云原生基础设施方面拥有超过15年的实践经验。他是Kubespray、LeaderWorkerSet、containerd/nerdctl、LLM-D、Kubean、Spiderpool、HwameiStor和Merbridge等项目的积极维护者和贡献者。他近期的研究重点是分布式推理、人工智能工作负载编排以及基于Kubernetes的生产级LLM服务。他在云计算和人工智能基础设施领域拥有23项中国专利和9项美国专利。加入DaoCloud之前,他曾担任DELL/EMC的高级技术专家,并因其在PaaS架构方面的贡献而荣获EMC创新CTO奖。
Area of Expertise
Topics
[闪电演讲] 基于 Agentic AI 的 Kubernetes 运维:利用 Kubectl-AI 提升效率
Kubernetes 生态中涌现出不少卓越的 AI 工具,kubectl-AI 是其中的典型代表, kubectl-AI 则是由 Google Cloud Platform 孵化的一款智能接口工具,它能将用户的自然语言意图转化为精确的 Kubernetes 操作命令,显著降低了集群管理的门槛,并提升了运维效率
本演讲的的包括:
- 核心功能:你可以通过 kubectl-AI 进行集群错误分析和获取情境化的洞察信息。它支持多种 AI 模型后端(如 OpenAI 的 GPT 系列、Google 的 Gemini 系列、Azure OpenAI,甚至本地模型如通过 Ollama 运行的模型),能将你的自然语言指令转换为 kubectl 命令并执行,同时还能对结果进行解释
- Demo:这些工具能极大地简化 Kubernetes 的故障排查过程。例如,对于 Pod 处于 Pending 或 CrashLoopBackOff 状态这类常见问题,它们可以快速诊断并提供修复建议。kubectl-AI 还能处理诸如“检查某个命名空间的日志”或“扩展某个 Deployment 的副本数”这样的自然语言请求。
- Deep Dive:kubectl-AI 的技术架构体现了模块化设计,通常包含输入解析层(负责自然语言处理、上下文管理和意图识别)、模型适配层(处理多模型路由和提示工程)以及 Kubernetes 集成层(负责命令验证、执行和结果解释)。
Building Custom GPU Clusters at Scale: Using Kubespray to Create High-Performance AI Infrastructure
Kubespray, recognized by Kubernetes' SIG Cluster Lifecycle, deploys production-ready Kubernetes clusters on bare metal, enhancing performance for AI applications with robust GPU support. This session covers Kubespray's fundamentals, key features, and updates.
As AI workloads like LLMs grow, scalable GPU clusters are essential. Engineers will share insights from deploying custom GPU clusters at scale with Kubespray, discussing challenges and best practices. Attendees will learn to integrate Kubernetes technologies like LWS, Kueue, Gateway API Inference Extension, DRA, and tensor parallelism to enhance AI workloads like RAG and LoRA, improving resource utilization and performance.
We'll share Kubespray's inventory source code to customize AI clusters and use Kubernetes operators to define infrastructure in private clouds, enabling efficient cluster scaling.
AI-Powered Kubernetes Diagnostics with K8sGPT
In this Lightning Talk, we’ll dive into K8sGPT, a CNCF sandbox project that uses AI to enhance Kubernetes management. K8sGPT leverages LLMs to diagnose cluster issues, offering root cause analysis and solutions in simple terms. It encodes SRE expertise into analyzers, extracting key insights and enriching them with AI-powered explanations.
Key highlights:
- Core Features: Learn to use the CLI and K8sGPT Operator for cluster error analysis and contextualized insights.
- AI Integration & Security: Explore integration with AI models like OpenAI, Azure, and Ollama, with data anonymization for security.
- Real-world Demos: See how K8sGPT simplifies Kubernetes troubleshooting.
- Enterprise Strategies: Discover techniques like LoRA and RAG to tailor K8sGPT for specific environments.
Whether you're new to Kubernetes or an expert, K8sGPT can streamline cluster management, reduce troubleshooting time, and boost efficiency.
Kubespray Unleashed: Navigating Bare Metal Services in Kubernetes for LLM and RAG
Kubespray, popular within the SIG-Cluster-Lifecycle of Kubernetes, is celebrated for deploying production-ready Kubernetes clusters, particularly on bare metal, which boosts performance for AI workloads like LLM and RAG. This session will explore using Kubespray in bare metal settings, addressing challenges, and sharing best practices.
The first part of the talk will show Kubespray's key features and provide practical tips. The latter half will focus on swiftly deploying AI using Retrieval-Augmented Generation (RAG), demonstrating how Kubespray facilitates setting up Kubernetes clusters on bare metal. This setup enhances AI applications by integrating continuous knowledge updates and domain-specific information via RAG, improving the accuracy and credibility of the AI systems.
The session will conclude with discussions on community engagement and future advancements, followed by a Q&A period to address participant queries.
How to deploy an AI-optimized k8s cluster with Kubespray
Kubespray is one of the most popular projects in the SIG-Cluster-Lifecycle community of Kubernetes, often used in a bare-metal environment. As AI workloads are rapidly increasing, bare metal can provide superior performance. Therefore, this session will share features and best practices of using Kubespray to build an AI-optimized cluster.
In the first half of the session, we will demo and discuss the most main features of Kubespray, and we'll also share useful tips and best practices from Kubespray.
In the second half of the session, we will highlight enhanced features and share best practices to support AI workloads. This will include insights on GPU support, scheduler enhancement, batch job queuing, RDMA network, DRA driver, GPU monitoring, and more.
Lastly, we aim to delve deeper into community engagement and open a discussion about progressing the project further. We will then allocate a substantial amount of time for questions.
nerdctl: Docker-compatible CLI for containerd
During this session, participants will learn about nerdctl’s compatibility compared to Docker and Podman, along with features that Docker has not yet implemented. These include:
* Lazy-pulling with Stargz/Nydus/OverlayBD
* Peer-to-peer image distribution with IPFS
* Image encryption with OCIcrypt
* Image signing with Cosign
* Slirp-less rootless containers with bypass4netns
* Interactive Dockerfile debugging with buildg
Furthermore, the session will delve into nerdctl’s features, related projects(such as Lima, AWS Finch, Colima, Rancher Desktop, Kind ...), and the envisioned roadmap for its future development. Lastly, we aim to delve deeper into community engagement to contribute to the project.
SIG Cluster Lifecycle: What's new in Kubespray
Kubespray is one of the most versatile Kubernetes-cluster manager, and it benefits an extremely active worldwide community, especially in Asia.
In the first half of the session we will demo and discuss the most recent features such as HA with kube-vip, Manage offline files script for Air-Gap environment, fast image mirror, New OS(Rocky, Kylin, OpenEuler Linux, OpenEuler Linux...) Support, multi-arch cluster, support for Ansible collections, Cluster Hardening, work with the operator and GitOps. And we'll also share useful tips and best practices from Kubespray.
In the second half part, we would like to share some deep-dive about giving voice to the community and open a discussion about how to keep moving the project forward. And then allow a large amount of time for questions.
KubeCon + CloudNativeCon Europe 2026 Sessionize Event
KCD Hangzhou + OpenInfra Days China 2025 Sessionize Event
KubeCon + CloudNativeCon China 2025 Sessionize Event
KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 Sessionize Event
Maintainer Track + ContribFest: KubeCon + CloudNativeCon Europe 2024 Sessionize Event
KubeCon + CloudNativeCon + Open Source Summit China 2023 Sessionize Event
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top