How to scrape modern websites to feed AI agents

It was web scraping that quietly enabled the AI revolution.

All major generative AI models and large language models (LLMs) have been trained on massive datasets extracted from the web. But the story doesn’t end there. Today’s AI agents still rely on up-to-date information from online sources to stay relevant, reduce hallucinations, and go beyond their training data. Once again, the web becomes the richest source of context.

Yet scraping the modern web is far from trivial. From anti-bot protections to dynamic, client-side rendering and long-term maintenance issues, building scalable data pipelines has become increasingly complex.

In this talk, we’ll demonstrate how to:

- Set up resilient, maintainable data extraction workflows using cloud-based web scraping Actors,
- Clean messy HTML and structure it into meaningful content,
- Feed that data into a vector database to power a Retrieval Augmented Generation (RAG) pipeline,
- And integrate scraping tools directly into AI agents using the Model Context Protocol (MCP).

Whether you're building copilots, smart assistants, or knowledge-augmented LLM apps, this talk will equip you with the techniques and tools to make the web your AI's best friend.

Jan Curn

Founder & CEO of Apify

Prague, Czechia

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

How to scrape modern websites to feed AI agents

Jan Curn

Links

Actions