Reddit in AI content licensing deal with Google - Kotaku In Action 2

Reddit in AI content licensing deal with Google (archive.ph)

posted 1 year ago by altmehere 1 year ago by altmehere +35 / -0

17 comments

17 comments share save hide report block hide replies

You're viewing a single comment thread. View all comments, or full comment thread.

Comments (17)

sorted by:

▲ 2 ▼

– zakat 2 points 1 year ago +2 / -0

Honestly not that big of a deal. OpenAI used it as well for GPT.

GPT 2 used outgoing reddit links for their training set. "OpenAI developed a new corpus, known as WebText; rather than scraping content indiscriminately from the World Wide Web, WebText was generated by scraping only pages linked to by Reddit posts that had received at least three upvotes prior to December 2017."

GPT-3 then used Reddit posts itself. "OpenAI's GPT series was built with data from the Common Crawl dataset, a conglomerate of copyrighted articles, internet posts, web pages, and books scraped from 60 million domains over a period of 12 years. TechCrunch reports this training data includes copyrighted material from the BBC, The New York Times, Reddit, the full text of online books, and more".

They haven't published what was used for GPT-4 afaik.

permalink save report block reply

▲ 3 ▼

– LastRights 3 points 1 year ago +3 / -0

It is a big deal, as we've already seen just how it poisons the output.

permalink parent save report block reply