Approach and best practices for building a small language model from scratch for Data Engineering

As language models and AI agents become increasingly embedded in data engineering workflows, the need for lightweight, domain-adapted alternatives to large foundation models is growing. This session presents a practical, hands-on approach to building a small language model (SLM) optimized for data engineering use cases such as pipeline automation, metadata enrichment, SQL generation, and intelligent agent orchestration.
Attendees will learn best practices for model sizing, data preparation, and fine-tuning techniques on custom datasets, emphasizing leveraging open-source tools such as transformers, datasets, and peft libraries. We will also explore how to build and integrate AI agents using small models to perform modular, goal-directed tasks in data engineering pipelines. Finally, the session covers deployment on commodity hardware using model optimization and quantization strategies to reduce memory, improve inference speed, and maximize utility.
This talk equips practitioners with actionable frameworks, agent design patterns, and operational strategies to create efficient, domain-aware language models and AI agents for real-world data engineering challenges.

Anandaganesh Balakrishnan

American Water, Principal Software Engineer

Philadelphia, Pennsylvania, United States

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Approach and best practices for building a small language model from scratch for Data Engineering

Anandaganesh Balakrishnan

Links

Actions