- Reddit may block search crawlers if it can’t reach deals with generative AI companies to pay for its data.
- 535 news organizations have already installed blockers.
- Without data, genAI’s output will suffer – so will companies pay?
AI is going to take over the journalistic world, leaving writers to retrain or become destitute (AKA work in hospitality). But first, generative AI needs to do some reading.
Possibly the most vital resource in the age of generative AI is digital news stories. For years, companies like OpenAI has used news stories to build the data sets that teach machines how to recognize and fluidly respond to human queries.
Then, in August, roadblocks started making this training more difficult. Since the end of summer, at least 535 news organizations, including the New York Times, Reuters and the Washington Post, have installed blockers.
The blockers prevent content on the news websites from being crawled, collected and used in AI training corpora.
Some reports reckon it’s the publishers’ way of ensuring their share in the huge generative AI market that’s projected to reach $1.3 trillion by 2032. Or the question of copyright and intellectual property is – at the least – too murky for anything other than blanket bans.
Discussions are now focused on paying publishers so the chatbot can surface links to individual news stories in its responses. This would benefit newspapers by providing them with direct payment, obviously, and by increasing traffic to their websites. It would also solve the issue of crediting journalists’ work.
Cash flooded into generative AI in the first three quarters of 2023, to the sum of almost $16 billion in venture capital. The figure reflects, in part, how expensive the technology is to build and run.
Until now, once those issues had been surmounted, the data was free and easy. Common Crawl is a nonprofit that wouldn’t charge Google, Meta, OpenAI or anyone else to use its service; it crawls the internet in search of online text. The information is archived for others to download.
Those archives, along with online data sets, were used to assemble the vast quantities of natural language and specialized information needed to train large AI systems. Tech companies also accessed information made available for research purposes and increasingly strayed from information clearly in the public domain.
Initially, tech companies were loathe to pay for that data. At a listening session on generative AI hosted in April by the U.S. Copyright Office, Sy Damle, a lawyer representing the Silicon Valley venture capital firm Andreessen Horowitz, acknowledged that “the only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license that data.”
Yet, in July, OpenAI made a deal to license content from the Associated Press as training data for its AI models. That idea has also been discussed in the current talks, although they are more concentrated on showing stories as part of ChatGPT responses.
It’s not just reputable information sources that are looking for compensation. Reddit has met with top generative AI companies about being paid for its data. If no deal is reached, Reddit is considering blocking search crawlers from Google and Bing. The irony here, of course, is that all content on Reddit is provided by its users, not the company, who merely supply a framework on which others publish.
The downside of blockers is that they will hide the site from human readers, too. A person familiar with the matter, speaking on condition of anonymity, said that “Reddit can survive without search.”
OpenAI and Google have released tools to block their AI data crawlers but this feels like a set-up, somehow. Online forums including Reddit, Stack Overflow and Wikipedia have their own defensive measures, launching paid portals for AI companies seeking training data.
Rather than providing “data dumps” that made content easily available for AI training, there are closely monitored limits on how often their sites can be mined for data.
Before news sites, Elon Musk, who has critiqued AI, began charging $42,000 for bulk access to posts on Twitter in April. He claimed AI companies has illegally used the data to train their models – researchers also have to pay.
There’s a growing sense of urgency and uncertainty about who profits from online information. See the sudden uptick in prosecutions and site seizures of shadow libraries, for example. With generative AI poised to transform interaction with the internet, fair payment for data is becoming an existential issue for many companies.
A month after the launch of OpenAI’s GPT-4 in March, traffic to Stack Overflow declined by 15%. According to CEO Prashanth Chandrasekar, the cause of the drop in usership was programmers turning to AI for answers to their coding questions. He also thinks the AI was trained on Stack Overflow’s data.
This week the company laid off 28% of its staff.
As well as the demands for payment, AI firms are facing a slew of copyright lawsuits from authors, artists and software coders seeking damages for infringement – and a cut of AI profits.
Former Arkansas governor Mike Huckabee joined the fray as a plaintiff in a class-action against Meta, Microsoft and Bloomberg for using AI tools with pirated books to train AI systems. Meanwhile, trade groups are pushing lawmakers for the right to bargain collectively with tech companies.
Generative AI licensing: the only way forward?
It could be that OpenAI’s engagement in negotiations reflects a desire to strike deals before courts get the chance to weigh in on whether tech companies have a clear legal obligation to license – and pay for – content.
That’s according to James Grimmelmann, a professor of digital and information law at Cornell University, who recently helped organize a workshop on generative AI and the law at the International Conference on Machine Learning.
While Reddit, Stack Overflow and news organizations usher in what he called a new era of “data strikes,” Nicholas Vincent, a professor of computing science at Simon Fraser University in British Columbia, warned that publishers will have to find strength in numbers: AI operators “never, ever care about one person leaving,” he said.
At a news conference in May, NewsCorp chief executive Robert Thomson echoed that idea when asked if he’d like to announce a deal with the big digital players. “I wish,” he responded, “But it can’t just be us.”
Media conglomerate IAC has tried building a coalition of publishers who aimed to win billions of dollars from AI companies through a lawsuit or legislative action. The New York Times has also considered a lawsuit against OpenAI.
Data holders are in the position to make a deal, particularly companies used to asserting their intellectual property rights. Individual artists, authors and coders are at a disadvantage with fewer resources at their disposal.
Danielle Coffey, president and CEO of the News/Media Alliance (NMA), a trade group representing more than 2,000 publishers, said the White House and other policymakers have been receptive to the need for licensing deals. She recently organized a week of visits in Washington and various state capitals to advocate for copyright protections for publishers.
She says that, with AI, “what goes in, must come out […] If quality content and quality journalism isn’t a part of that, then that is not a good thing for the products themselves — or for society.”
News outlets paywalling crawlers might make it easier to access news sites free. To be honest, if the quality of generative AI’s textual output drops, we’re not sure we’d be that affected.