Salvatore Ricciardi

Almawave, AI Engineer

Actions

I'm a theoretical physicist specialized in biological complex systems; i focus my work at Almawave on LLM evaluation, NLP metrics development and on intersection of statistical physics and machine learning.

Beyond Benchmarks: A Modular Evaluation Framework for Large Language Models

**INTRODUCTION**
As LLMs move from research into production, the real challenge is no longer just training models—it is understanding whether they actually work. Evaluating models on your tasks, in your language, and on your data quickly becomes a critical engineering problem, especially as new versions are tested every day. Standard benchmarks rarely capture this complexity.

In this talk, we retrace our journey through the challenges of LLM evaluation, showing how the need to move away from a fragmented and manual process led us to design a more systematic approach This resulted in a model-agnostic evaluation framework built in Python to define, compose, and run custom metrics tailored to specific use cases.

**KEY TAKEAWAYS**
We will walk through the architecture and design principles behind the framework, showing how it combines:

- Traditional metrics to evaluate LLM outputs on standard tasks
- Innovative approaches like LLM-as-a-judge assessments for nuanced, criteria-driven evaluation
- Hybrid pipelines that compose these into a single score

Through concrete examples drawn from real production use cases—including retrieval-augmented generation, multilingual customer support, information extraction, and function calling—we will demonstrate how evaluation can be made both rigorous and scalable. We will also discuss how async processing and pandas integration enable efficient evaluation across thousands of model outputs.

Along the way, we will share practical lessons on:

- designing modular, extensible metric architectures,
- building evaluation datasets that reflect real-world complexity,
- avoiding the pitfalls of single-metric evaluation, and
- integrating custom benchmarks into model fine-tuning and deployment workflows.

Salvatore Ricciardi

Almawave, AI Engineer

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Salvatore Ricciardi

Actions

Links

Sessions

Beyond Benchmarks: A Modular Evaluation Framework for Large Language Models

Salvatore Ricciardi

Links

Actions