Reddit in AI content licensing deal with Google
(archive.ph)
You're viewing a single comment thread. View all comments, or full comment thread.
Comments (17)
sorted by:
Honestly not that big of a deal. OpenAI used it as well for GPT.
GPT 2 used outgoing reddit links for their training set. "OpenAI developed a new corpus, known as WebText; rather than scraping content indiscriminately from the World Wide Web, WebText was generated by scraping only pages linked to by Reddit posts that had received at least three upvotes prior to December 2017."
GPT-3 then used Reddit posts itself. "OpenAI's GPT series was built with data from the Common Crawl dataset, a conglomerate of copyrighted articles, internet posts, web pages, and books scraped from 60 million domains over a period of 12 years. TechCrunch reports this training data includes copyrighted material from the BBC, The New York Times, Reddit, the full text of online books, and more".
They haven't published what was used for GPT-4 afaik.
It is a big deal, as we've already seen just how it poisons the output.