Session

Benchmarking LLMs on Vulnerability Prioritization

We present the first large-scale benchmarking of leading LLMs (GPT-4o mini, Claude 3.7, Gemini 2.5) against EPSS on the vulnerability prioritization task, using 50,000 CVEs stratified by real-world exploitation. Our results show that LLMs provide lumpy, poorly calibrated probability estimates, fail to maintain efficiency and coverage beyond 15%, and incur prohibitive inference costs at operational scale. In contrast, predictive models like EPSS and our Global Model deliver higher accuracy, better coverage, and practical cost profiles. We release our full dataset, agent (JayPT), and methodology under an MIT license to enable reproducibility and further research on scalable, evidence-driven vulnerability triage.

We present the first large-scale benchmarking of leading LLMs (GPT-4o mini, Claude 3.7, Gemini 2.5) against EPSS on the vulnerability prioritization task, using 50,000 CVEs stratified by real-world exploitation.

Michael Roytman

CTO at Empirical Security

Chicago, Illinois, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top