Session

Judgement Day: Benchmarking "Black Box" LLMs with Open Legal Datasets

As proprietary models like GPT-5 and Gemini assert dominance in professional domains, the open source community faces a critical challenge: how do we verify their claims without access to their weights? We cannot inspect their code, but we can rigorously audit their reasoning using open source benchmarks.

In this session, 16-year-old researcher Kannan Murugapandian presents a technical evaluation of state-of-the-art LLMs using the LegalBench open dataset.

Moving beyond simple Q&A, this session explores:

1. The Evaluation Harness: A deep dive into the custom Python-based testing asynchronous pipeline designed to standardize prompts, manage vector retrieval, and score outputs across disparate model APIs.
2. Open vs. Closed: A data-driven comparison of how open weights models (e.g., DeepSeek/Llama) stack up against closed giants when tasked with complex legal logic.
3. The "Persona" Myth: Quantitative results testing whether "lawyer personas" actually reduce hallucination rates or merely change the output tone.

Kannan Murugapandian

Student, DPS International School

Singapore

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top