For example, do you think there is a practical difference under law from letting a text to speech application read you an e-book from Amazon, vs training your Large Language Model with an e-book from Amazon?
If there is a difference, in your mind, what do you think that it is?
If there is a difference, in your mind, what do you think that it is?
The text-to-speech application is a transient means of communicating the book. It's no different from opening the e-book on a monitor to read the words. Meanwhile, the LLM is ingesting and storing the text of the book. It's an illegal copy permanently stored in the model's dataset.
That said, I'd rather this be resolved by fixing the issues with copyright. This ruling is just another example of the two-tiered system, where AI training is fair use, while you giving a copy to a friend is infringement.
Is it not though? Are you saying after book X is used for training that you couldn’t then prompt the AI to “tell me word for word the exact text of book X”?
No, the book isn't copied or stored. The LLM can't regurgitate it on command, because it isn't inside the model.
You can ask the LLM to write new, never before seen text in the style of that author.
Training a LLM is a lot more like reading a book to a toddler than it is like making a digital copy. Neither the toddler or the LLM can repeat the words of the book.
In terms of copyright, yes it is. It doesn't matter that the book isn't literally copy-pasted into a vector database. The text is used verbatim as training data, and from there isn't made into a sufficiently transformative work to constitute fair use (plus it's commercial). Training data, even if it can neither be recalled on demand nor exists in whole form, has still been stored within the model's semantic memory.
You purchased a text to speech application, and then had to legally gain access to said book. The book is property which has restricted access that you have to either purchase or pay for the service of gaining temporary access to the book. The purpose of your use of the text to speech is for the private consumption of the book's material.
Using a LLM to read a book that you didn't purchase access to, and turning it into the basis for the training of your program is still going to be theft of the material. If you're going to purchase it for training material, you can probably do that with the explicit recognition that the work and text of the book can't be re-distributed through your algorithm. You might be able to get away with specific citations for quotes, but the work itself can't be redistributed as profit for the book would have been diminished by your algorithm's actions. If the LLM's work that comes from the book is transformative, then you're fine.
What would not be fine is for books to be protected by copyright and paywall, not purchased, and still be data mined for their material without compensation to the owner. This is especially true with something as vast as a for-profit business as an AI algorithm.
It isn't theft. Theft, by definition ,requires that the criminal deprive someone of something. You can't steal a digital product.
For example, if you steal my car, I can't drive my car. That is theft. Downloading episodes of Magnum PI isn't theft.
It isn't copyright infringement, because that (by definition) covers the uncompensated reproduction and sale of an intellectual property. Training a LLM with an e-book doesn't copy or sell the book.
I think that you need to go and read some definitions, because you are groping in the dark for reasons to object and using words you don't actually understand.
If you want to advocate for new laws that govern how people can create and run computer programs on hardware they own, then advocate for that. You will certainly have a lot of company.
That's never been true. If you want to make a philosophical argument that digital property shouldn't be considered property, fine. But the law is clear that digital assets can be stolen.
Look, guy, it is a matter of definition. Different words mean different things.
If the thief took every copy and the lawful owner could not use the digital product, then that would be theft. If the bad guy just made unauthorized copies or used the digital data (or whatever) without a license, then it isn't theft.
The reason that the distinction matters is because harm must be assessed when rendering legal judgement by the courts. If I download a copy of Magnum PI, I have not depraved anyone of the use or enjoyment of the episode. If I stole an episode with ninjas then no network in the world could use it until it was recovered, probably by men with guns.
These are not the same. The damages are not the same.
Gizortnik, this is well understood law. It isn't even debated in legal circles. It was debated to exhaustion at around the time of printed sheet music, more than 200 years ago.
You really are showing your profound ignorance in this specific issue.
Do you?
For example, do you think there is a practical difference under law from letting a text to speech application read you an e-book from Amazon, vs training your Large Language Model with an e-book from Amazon?
If there is a difference, in your mind, what do you think that it is?
The text-to-speech application is a transient means of communicating the book. It's no different from opening the e-book on a monitor to read the words. Meanwhile, the LLM is ingesting and storing the text of the book. It's an illegal copy permanently stored in the model's dataset.
That said, I'd rather this be resolved by fixing the issues with copyright. This ruling is just another example of the two-tiered system, where AI training is fair use, while you giving a copy to a friend is infringement.
That's not how it works.
Is it not though? Are you saying after book X is used for training that you couldn’t then prompt the AI to “tell me word for word the exact text of book X”?
No, the book isn't copied or stored. The LLM can't regurgitate it on command, because it isn't inside the model.
You can ask the LLM to write new, never before seen text in the style of that author.
Training a LLM is a lot more like reading a book to a toddler than it is like making a digital copy. Neither the toddler or the LLM can repeat the words of the book.
In terms of copyright, yes it is. It doesn't matter that the book isn't literally copy-pasted into a vector database. The text is used verbatim as training data, and from there isn't made into a sufficiently transformative work to constitute fair use (plus it's commercial). Training data, even if it can neither be recalled on demand nor exists in whole form, has still been stored within the model's semantic memory.
You purchased a text to speech application, and then had to legally gain access to said book. The book is property which has restricted access that you have to either purchase or pay for the service of gaining temporary access to the book. The purpose of your use of the text to speech is for the private consumption of the book's material.
Using a LLM to read a book that you didn't purchase access to, and turning it into the basis for the training of your program is still going to be theft of the material. If you're going to purchase it for training material, you can probably do that with the explicit recognition that the work and text of the book can't be re-distributed through your algorithm. You might be able to get away with specific citations for quotes, but the work itself can't be redistributed as profit for the book would have been diminished by your algorithm's actions. If the LLM's work that comes from the book is transformative, then you're fine.
What would not be fine is for books to be protected by copyright and paywall, not purchased, and still be data mined for their material without compensation to the owner. This is especially true with something as vast as a for-profit business as an AI algorithm.
Oh, Gizortnik. Here we go.
It isn't theft. Theft, by definition ,requires that the criminal deprive someone of something. You can't steal a digital product.
For example, if you steal my car, I can't drive my car. That is theft. Downloading episodes of Magnum PI isn't theft.
It isn't copyright infringement, because that (by definition) covers the uncompensated reproduction and sale of an intellectual property. Training a LLM with an e-book doesn't copy or sell the book.
I think that you need to go and read some definitions, because you are groping in the dark for reasons to object and using words you don't actually understand.
If you want to advocate for new laws that govern how people can create and run computer programs on hardware they own, then advocate for that. You will certainly have a lot of company.
That's never been true. If you want to make a philosophical argument that digital property shouldn't be considered property, fine. But the law is clear that digital assets can be stolen.
Look, guy, it is a matter of definition. Different words mean different things.
If the thief took every copy and the lawful owner could not use the digital product, then that would be theft. If the bad guy just made unauthorized copies or used the digital data (or whatever) without a license, then it isn't theft.
The reason that the distinction matters is because harm must be assessed when rendering legal judgement by the courts. If I download a copy of Magnum PI, I have not depraved anyone of the use or enjoyment of the episode. If I stole an episode with ninjas then no network in the world could use it until it was recovered, probably by men with guns.
These are not the same. The damages are not the same.
Gizortnik, this is well understood law. It isn't even debated in legal circles. It was debated to exhaustion at around the time of printed sheet music, more than 200 years ago.
You really are showing your profound ignorance in this specific issue.
My comment was more about the fact that companies 'shouldn't pirate' than how the technical side of LLM training works.