Session

LLM-generated code and open source license compliance: how big is the problem?

Recent research has raised concerns about LLM-generated code exhibiting significant similarity to their training data, raising potential legal issues with incompatible software licenses. While Xu et al. established a benchmark for evaluating this phenomenon through their LiCoEVAL benchmark, showing small but significant portions of LLM outputs containing "notably similar" code to existing open-source implementations, these findings were limited by the scope of their reference dataset.

This presentation explores new research that expands upon these initial findings by leveraging STF's osskb.org service, a data set 35 times larger than the original study, and SCANOSS open source scanner scanoss.py. Using the Winnowing algorithm, the speakers analysis revealed similarity rates significantly higher than previously reported.

The speakers will demonstrate during the talk how this expanded reference base impacts detection rates, validate the effectiveness of Winnowing algorithm as preliminary indicator for code similarity, and provide some open questions to trigger a discussion about the implications of using AI coding assistants.

Link to the study: https://shorter.me/_XHcS

Agustin Benito Bethencourt

Independent Consultant

Los Llanos de Aridane, Spain

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top