OpenAI Claims Generative AI Hinges on Copyrighted Content Use: Publishing Pulse

2024-01-11T14:16:26

Get helpful updates in your inbox

Welcome to Publishing Pulse, your weekly source for industry updates in online publishing. Stay informed about the latest trends and breakthroughs in the ad ecosystem, content creation, SEO, AI technology, and monetization.

If you prefer to listen to industry news, you can tune in to The Publisher Lab podcast. New episodes are released weekly on Thursday.

The Rise of Google’s Search-Generative Experience

Google’s Search Generative Experience (SGE) now influences a substantial 84% of search queries, with its three main formats being Opt-in (68%), Collapsed (16%), and None (15%). This evolution in search experience is particularly notable in YMYL queries.

Among the most prevalent SGE content formats are unordered lists, found in 48% of cases, often accompanied by brief descriptions. Another variant, constituting 26% of SGE instances, offers more detailed “breakouts” with expanded information and interactive elements.

Google SGE "here are some products to consider" https://t.co/vf2zgsy97W via @b4k_khushal pic.twitter.com/0hnBVAH5f3
— Barry Schwartz (@rustybrick) January 11, 2024

Google is also experimenting with a variety of product display formats in SGE, enhancing the search experience for e-commerce and retail-related queries. These formats include Product Listings with Sourced Descriptions, Carousel Groupings, and Valuecards for specific products. Additionally, Google has introduced warnings in SGE answers for categories like Age, Financial, Medical, Legal, and Dangerous, emphasizing their commitment to providing contextually accurate and safe information to users. These developments reflect Google’s continuous efforts to refine and personalize the search process, aligning with user needs and preferences.

SGE Results Don’t Match Organic Search Results

An in-depth analysis by Authoritas on Google’s Search Generative Experience (SGE) reveals a significant shift in how search results are presented and potentially impact website traffic. The study, encompassing 1,000 commercial keywords across diverse categories conducted in December 2023, found that SGE answers seldom correspond with the top 10 Google organic search results.

In fact, 93.8% of the time, the URLs provided by SGE are entirely different from those in the organic listings. This divergence suggests that searchers obtaining their answers directly from Google’s AI-generated responses might bypass visiting actual websites, potentially reducing organic traffic. Despite this, websites not ranking in the top 10 organic results still have a chance to be featured in SGE links.

Interestingly, the overlap between SGE links and organic search results is minimal, with only about 4.5% of SGE links exactly matching those in organic searches and a mere 1.6% matching at the domain level. On average, while SGE displays roughly 10 links, typically only four are unique, often sourced from different websites. The prevalence of SGE in search queries is notable, appearing in 86.8% of the analyzed keywords. Furthermore, the format of SGE predominantly features the “Generate” button version over the “Show more” button. This emerging trend in search experience underscores a new landscape where Google’s AI-driven answers could reshape web traffic patterns and the visibility of websites in search results.

The New York Times Sues OpenAI, Microsoft For Copyright

The New York Times has initiated a significant legal battle against tech giants OpenAI and Microsoft, alleging copyright infringements due to the utilization of the Times’ content in training their generative AI models. In a landmark lawsuit, The Times is demanding the destruction of all models and training data derived from its content, alongside seeking billions of dollars in damages.

The core of the lawsuit lies in the accusation that Microsoft’s Copilot and OpenAI’s ChatGPT have been using Times’ articles without authorization, leading to the dissemination of inaccurate information under the Times’ name. The lawsuit also claims that these AI models are not just using the Times’ content unlawfully but are also competing with news publishers, which could significantly harm The Times’ business operations.

There is a particular concern about the impact of these practices on the news subscription business and overall web traffic for publishers, a sentiment echoed in similar legal actions against other tech companies.

The situation is complex, particularly as generative AI models can unintentionally echo the content they were trained on, suggesting that OpenAI and Microsoft are profiting from the Times’ journalism without offering due compensation.

This legal action follows a pattern where publishers are increasingly challenging tech companies that they perceive as encroaching on their content and revenue. Not all publishers are opting for the legal route, however; some have entered into licensing agreements with AI vendors.

The outcome of this lawsuit could hinge on various factors, including whether the AI’s output is a result of direct user manipulation to reproduce copyrighted material, a defense that legal experts believe could weaken the Times’ case.

OpenAI Negotiates With Publishers To License Content

OpenAI is currently engaged in strategic negotiations with several publishers to acquire licenses for using their articles in the training of AI models. This initiative has already borne fruit, as OpenAI has successfully secured licensing agreements with prominent companies such as Axel Springer SE and The Associated Press. These agreements grant OpenAI legitimate access to a wealth of content, which is crucial for the development and refinement of its AI models.

“The AI models that we have today are not like teaching a child; it’s more like feeding them to a plagiarism machine,” says Tyler Bishop on this week’s episode of The Publisher Lab podcast.

The future of these negotiations and the broader relationship (and growing tension) between AI technology companies and content publishers hinges on the legal interpretation of AI’s role in using copyrighted materials for model training.

OpenAI Says Generative AI Is Impossible Without Copyright

In a recent submission to the House of Lords communications and digital select committee, OpenAI, known for developing ChatGPT and GPT-4, emphasized the critical need for accessing copyrighted material to create sophisticated AI tools.

The company highlighted that copyright law encompasses a broad spectrum of human expressions, which are indispensable for training AI models effectively. OpenAI pointed out that limiting these models to public domain content, which is significantly outdated, would not yield AI systems capable of addressing contemporary societal needs. This statement comes in the backdrop of a lawsuit filed by The New York Times (NYT) against OpenAI and Microsoft, accusing them of illegally using NYT’s content to develop their AI products.

OpenAI, along with other AI developers, often leans on the legal doctrine of “fair use” to justify their use of copyrighted material in AI model training. The company argues that the current copyright laws do not specifically prohibit the use of such content for developing AI models. Furthermore, OpenAI has expressed willingness to engage in independent analysis and safety testing of its AI systems.

This commitment is part of an agreement reached at a global safety summit, where OpenAI agreed to collaborate with governments on safety testing for its most powerful models. However, the lawsuit from NYT is not an isolated case; OpenAI faces several legal challenges, including those from authors and Getty Images, over alleged copyright violations. Similarly, Anthropic, backed by Amazon and responsible for the Claude chatbot, is also under legal scrutiny from music publishers like Universal Music for purportedly misusing copyrighted song lyrics in training its AI model.

Sarah is a social media expert and successful brand marketer. She has experience growing brands and content across multiple different platforms and is always on the cutting edge of emerging social platform and internet culture trends.