60% of OpenAI’s GPT-3.5 Outputs Contained Some Form of Plagiarism

Various content creators, including authors, songwriters, and media outlets like The New York Times, are taking legal action, claiming that generative AI, trained on copyrighted content, produces identical copies without permission.

Before ChatGPT was introduced, Copyleaks, an artificial intelligence text analysis company, had already offered plagiarism detection services to companies and educational institutions for some time.

When ChatGPT first launched, it used the GPT-3.5 model, but OpenAI has now upgraded to the more advanced and powerful GPT-4.0 for its operations.

Plagiarism can manifest in various ways beyond just directly copying and pasting entire sentences and paragraphs.

Copyleaks aims to transform the subjective judgment of spotting plagiarism into a precise and scientific process.
The company employs a unique scoring system that combines measures of identical text, minor modifications, paraphrased content, and other elements to generate a “similarity score” for each piece of content.
According to the report, for GPT-3.5, approximately 45.7% of outputs featured identical text, 27.4% included minor alterations, and 46.5% contained paraphrased content.
According to the report, a score of 0% indicates that all the content is original, while a score of 100% signifies that none of the content is original.

Copyleaks requested approximately a thousand outputs from GPT-3.5, each consisting of about 400 words, covering 26 different subjects.

Among the GPT-3.5 outputs analyzed, the one with the highest similarity score was in computer science (100%), with physics (92%) and psychology (88%) following closely behind.

The subjects with the lowest similarity scores were theater (0.9%), humanities (2.8%), and English language (5.4%).

“Our models were created and trained to understand concepts to aid in problem-solving. We have implemented safeguards to prevent unintentional memorization, and our terms of service forbid the deliberate use of our models to reproduce content.“
OpenAI spokesperson Lindsey Held stated in a communication to Axios,

In the legal case filed by The New York Times against Microsoft and OpenAI, it is alleged that the AI systems’ extensive replication of content amounts to copyright infringement.

In response to the lawsuit, OpenAI contended that “regurgitation” is an uncommon issue and accused The New York Times of manipulating prompts.