Session

Modernise web scraping: use a self-hosted LLM to extract data from (almost) any website

Web scraping is complex and brittle. When the scraped website's theme or structure changes, the scraper must adapt, making it troublesome. Learn how to write web scrapers that use AI to skip manual matching and still get the desired data in the appropriate format.

In this (partly) hands-on session, you will learn how to integrate a self-hosted LLM (Llama) into your web scraping toolchain to modernise it. As a demo, You will see how the same code works for multiple types of websites. You will know how to get more done with less effort. You can host the LLM of your choice anywhere and use LiteLLM to make it easy to change. I will also share my experience with two side projects based on web scraping. We will review the pros and cons of both the procedural and the new AI-enabled methods. Come prepared to grasp a new way of web scraping to pluck desired data from (almost) any website.

Geshan Manandhar

Senior Software Engineer, Simply Wall St.

Sydney, Australia

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top