Openai was accused by a lot Parties of training of its AI on the content protected by copyright without authorization. Now a new paper By an AI monitoring organization, the serious accusation that the company is relying more and more on non -public books, it has not made it possible to train more sophisticated AI models.
AI models are mainly complex prediction engines. Trained on many data – books, movies, television programs, etc. – They learn models and new ways to extrapolate from a simple prompt. When a model “writes” an essay on a Greek tragedy or “draws” Ghibli style images, it simply takes its wide knowledge to approximate. There is nothing new.
Although a number of AI laboratories, including Openai, have started to adopt data generated by AI to form AI while exhausting real world sources (mainly the public web), few have avoided real data entirely. This is probably because purely synthetic data training involves risks, such as worsening the performance of a model.
The new article, of the AI disclosure project, a non-profit organization co-founded in 2024 by the Tim O’Reilly media magnate and economist Ilan Strauss, draws the conclusion that Optaai probably formed his GPT-4O model on the books of O’Reilly Media. (O’Reilly is the CEO of O’Reilly Media.)
In Chatgpt, GPT-4O is the default model. O’Reilly has no license agreement with OPENAI, said the newspaper.
“GPT-4O, the more recent and capable model of Openai, demonstrates a strong recognition of the content of the book O’Reilly Wandald O’Reilly… compared to the anterior model of Openai GPT-3.5 Turbo”, wrote the co-authors of the newspaper. “On the other hand, GPT-3.5 Turbo shows greater relative recognition of the O’Reilly books accessible to the public.”
The paper used a method called Take offIntroduced for the first time in an academic article in 2024, designed to detect the content protected by copyright in language training data. Also known as the “Atference attack of adhesion”, the method tests if a model can reliably distinguish the texts from the human exchang from paraphrases and generated by AI of the same text. If this can, this suggests that the model could have prior knowledge of the text from its training data.
The co-authors of the article-O’Reilly, Strauss and the researcher of the Sruly Rosenblat-say they have surveyed GPT-4O, GPT-3.5 Turbo and other knowledge of the OPENAI models on the books of the O’Reilly media published before and after their training dates. They used 13,962 extracts from paragraphs of 34 O’Reilly Books to estimate the probability that a particular extract has been included in the formation set of a model.
According to the results of the article, GPT-4O has “recognized” the content of the book O’Reilly Warwalled that the old models of Openai, including GPT-3.5 Turbo. It is even after taking into account the potential confusion factors, the authors said, such as improvements in the more recent capacity of the models to be determined if the text was authorized in human.
“GPT-4O [likely] Recognizes, as is the prior knowledge of, many non-public books published before its training date, ”wrote the co-authors.
It is not a firearm, the co-authors take care to note. They recognize that their experimental method is not infallible and that Optai may have collected extracts from user remuneration books to copy and paste it in Chatgpt.
By further grinding waters, co-authors have not evaluated the most recent collection of OpenAi models, which includes GPT-4.5 and “reasoning” models such as O3-Mini and O1. These models may not have been trained on the data from the O’Reilly Wisold book or were trained in a lower amount than GPT-4O.
That being said, it is not a secret for anyone that Optaai, who pleaded for more loose restrictions around development models using data protected by copyright, has been looking for better quality training data for some time. The company went to Hire journalists to help refine the results of his models. It is a broader industry trend: IA companies recruiting experts in fields such as science and physics at These experts have effectively fed their knowledge in AI systems.
It should be noted that Optai pays at least some of its training data. The company has license agreements in place with news publishers, social networks, media libraries and others. OPENAI also offers opt -out mechanisms – although imperfect – which allow copyright holders to report the content they prefer that the company is not used for training purposes.
However, while Openai beats several combinations on its training data practices and the processing of copyright law in American courts, O’Reilly paper is not the most flattering look.
OPENAI did not respond to a request for comments.