Varun Joshi
Senior Data Engineer at AWS
Seattle, Washington, United States
Actions
Highly motivated and results-oriented Data Engineer with 12+ years of experience in designing, building,and optimizing scalable data pipelines and architectures.Proven expertise in data warehousing, ETL/ELT processes, and cloud platforms. Passionate about leveraging Artificial Intelligence (AI) and Machine Learning (ML).
Designed and deployed AI-driven Data solutions, integrating LLM-powered coding assistants into Data Engineering to produce AI solutions for customers. Focused on leveraging LLMs and advanced engineering to build scalable, secure, and trustworthy platforms, resulting in significant efficiency gains, reduced on-call burden,and improved customer trust.
Driving AI adoption across teams to enhance productivity, streamline deployments, and improve end-user experience.
Area of Expertise
Topics
The Intent-Driven Data Architect: Using LLMs to Generate Type-Safe Synthetic Test Beds
In the era of "Intent-Driven" data platforms, we no longer just build pipelines; we build systems that respond to natural language queries. But how do we test these complex, cognitive systems without compromising sensitive production data? Enter the LLM Data Architect. This session explores a modern Pythonic workflow for generating high-fidelity, schema-validated JSON datasets on demand.
I will demonstrate how to use Pydantic to define a rigorous "data contract" and leverage Instructor or Outlines to force LLMs into producing perfectly structured, type-safe synthetic records. Attendees will walk away with a blueprint for a self-validating data generator that handles complex business logic and "long-tail" edge cases, ensuring your cloud-native apps are robust before the first user ever signs in.
SLA-Driven Data Engineering: How to Stop Shipping Pipelines Without Contracts
Most data pipelines ship without a contract. No defined freshness guarantee. No agreed row count tolerance. No documented owner. No stated downstream dependency. And when something breaks, everyone finds out at the same time — when the business report is wrong.
This session makes the case for treating pipeline SLAs the way backend engineers treat API contracts: as a first-class engineering artifact that gets defined before the pipeline ships, not retrofitted after the first incident. We'll walk through what a meaningful data SLA actually contains — freshness windows, volume tolerances, quality thresholds, blast radius documentation, and escalation paths — and how to operationalize them across a team without turning every pipeline into a bureaucratic exercise.
Stop Scraping, Start Generating: Why Synthetic Data is the Startup Superpower
For most early-stage startups, data is the "chicken and egg" problem. You need data to build a product, but you need a product to get data. Historically, the solution was scraping: building fragile, legally-dubious web crawlers that break the moment a website changes its CSS. In 2026, there is a better way.
This talk introduces Synthetic Data Generation (SDG) as a superior alternative to traditional scraping for startups. We will compare the high technical debt of maintenance-heavy scrapers against the scalability of generative models. You’ll learn how to go from "zero data" to a production-ready test suite or training set in hours rather than months.
We will cover:
The Hidden Cost of Scraping: Why maintenance, cleaning, and legal compliance (GDPR/EU AI Act) are startup killers.
The "Cold Start" Solution: Using Python to generate balanced, diverse datasets before your first user ever signs up.
A Startup Toolkit: A walkthrough of open-source Python libraries (like SDV, Faker, and Gretel-python) that allow you to "architect" your data instead of "hunting" for it.
Real-world Case Study: How to build a synthetic "Customer Feedback" loop to test your NLP models without a single real customer.
Generative AI meets Enterprise data stack
Large Language Models and Generative AI are no longer just research curiosities — organizations are now actively integrating them into their data platforms to unlock new forms of intelligence. This session examines how modern enterprises are embedding LLMs into their data stacks: from natural language interfaces for querying data warehouses, to AI-assisted data quality and documentation, to Retrieval-Augmented Generation (RAG) architectures that keep AI grounded in your own data. We'll cover practical patterns, pitfalls, and a live architecture walkthrough
From Prototype to Production: Deploying ML Models at Scale
Modern businesses need data that is fresh, reliable, and ready to act on — yet most organizations still struggle with batch pipelines that deliver stale insights. This session dives into the architectural decisions behind building production-grade, real-time data pipelines using open-source tools and cloud-native patterns. Through practical demonstrations, attendees will learn how to move beyond daily ETL jobs toward streaming architectures that power live dashboards, real-time fraud detection, and instant personalization.
RAG for Data Engineers: What It Is, Where It Fits, and Why Your Metadata Is the Missing Piece
Large language models are only as useful as the context you give them. For data engineers, that context is your schema, your lineage graph, your query history, your dbt models — and most teams haven't connected those dots yet.
This session is a practical introduction to Retrieval-Augmented Generation from a data engineering lens. We'll cover what RAG actually does under the hood, why it matters more for data teams than most, and the four places it shows up naturally in the data engineering workflow: answering schema questions without digging through a stale catalog, grounding SQL agents in your actual table definitions, giving incident response agents access to historical pipeline context, and surfacing institutional knowledge that currently lives only in senior engineers' heads.
LLMs as Data Architects
Testing data-intensive applications often feels like a choice between two evils: using risky real-world data or spending hours writing brittle scripts for "dummy" data that lacks realism. But what if we could use Large Language Models (LLMs) not just to chat, but to architect complex, structured, and schema-validated datasets on demand?
In this talk, we explore how to turn LLMs into reliable data architects. We will move beyond simple prompting and dive into Type-Safe Generation using Python. You will learn how to use Pydantic to define your "data contract" and leverage libraries like Instructor or Outlines to force LLMs to output perfectly formatted JSON that matches your application’s requirements every single time.
Key takeaways include:
Why "Prompting for JSON" usually fails and how to fix it with JSON Schema.
Using Pydantic to define complex, nested data structures for testing.
A comparison of OpenAI’s Structured Outputs vs. Local LLM constraints (using Outlines).
Real-world patterns for generating "Long-Tail" edge cases that traditional mock libraries miss.
Agentic Loops in the Data Stack: From Pipeline Failure to Auto-Remediation
Every data engineer knows the 2 AM pipeline failure — the one nobody notices until Friday's report is wrong. In this session, we break down five AI agents that are changing how data teams operate: from monitoring pipelines 24/7 and catching schema drift at ingestion, to closing the gap between a production failure and its root cause in minutes. We'll walk through real implementation patterns, including a baseline-learning monitoring agent and a tool-use driven incident response loop, and discuss what the shift to agentic data engineering actually means for the way teams are built and how engineers grow. Whether you're evaluating agents for your platform or already running them in production, you'll leave with concrete patterns you can apply immediately.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top