Although I’m a firm believer that most AI models should be public domain or open source by default, the premise of “illegally trained LLMs” is flawed. Because there really is no assurance that LLMs currently in use are illegally trained to begin with. These things are still being argued in court, but the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world.
The idea of… well, ideas, being copyrightable, should shake the boots of anyone in this discussion. Especially since when the laws on the book around these kinds of things become active topic of change, they rarely shift in the direction of more freedom for the exact people we want to give it to. See: Copyright and Disney.
The underlying technology simply has more than enough good uses that banning it would simply cause it to flourish elsewhere that does not ban it, which means as usual that everyone but the multinational companies lose out. The same would happen with more strict copyright, as only the big companies have the means to build their own models with their own data. The general public is set up for a lose-lose to these companies as it currently stands. By requiring the models to be made available to the public do we ensure that the playing field doesn’t tip further into their favor to the point AI technology only exists to benefit them.
If the model is built on the corpus of humanity, then humanity should benefit.
OpenAI hasn’t disclosed the datasets that ChatGPT is trained on, but in an older paper two databases are referenced; “Books1” and “Books2”. The first one contains roughly 63,000 titles and the latter around 294,000 titles.
These numbers are meaningless in isolation. However, the authors note that OpenAI must have used pirated resources, as legitimate databases with that many books don’t exist.
Should be easy to defend against, right-out trivial: OpenAI, just tell us what those Books1 and Books2 databases are. Where you got them from, the licensing contracts with publishers that you signed to give you access to such a gigantic library. No need to divulge details, just give us information that makes it believable that you licensed them.
…crickets. They pirated the lot of it otherwise they would already have gotten that case thrown out. It’s US startup culture, plain and simple, “move fast and break laws”, get lots of money, have lots of money enabling you to pay the best lawyers to abuse the shit out of the US court system.
For OpenAI, I really wouldn’t be surprised if that happened to be the case, considering they still call themselves “OpenAI” despite being the most censored and closed source AI models on the market.
But my comment was more aimed at AI models in general. If you are assuming they indeed used non-publicly posted or gathered material, and did so directly themselves, they would indeed not have a defense to that. Unfortunately, if a second hand provided them the data, and did so under false pretenses, it would likely let them legally off the hook even if they had every ethical obligation to make sure it was publicly available. The second hand that provided it to them would be the one infringing.
If that assumption turns out to be a truth (Maybe through some kind of discovery in the trial), they should burn for that. Until then, even if it’s a justified assumption, it’s still an assumption, and most likely not true for most models, certainly not those trained recently.
Banning AI is out of the question. Even the EU accepts that and they tend to be pretty ban heavy, unlike the US.
But it’s important that we have these discussions about how copyright applies to AI so that we can actually get an answer and move on, right now it’s this legal quagmire that no one really wants to get involved in except the big companies. If a small group of university students want to build an AI right now they can’t because of the legal nightmare that would be the Twilight zone of law that is acquiring training data.
AI is right-out unregulated in the EU unless and until you actually use it for something where it becomes relevant, then you’ve got at the lower end labelling requirements (If your customer service is an AI chat, say that it’s an AI chat), up to heavy, heavy requirements when you use it for stuff like sifting through job applications. The burden of proof that the AI isn’t e.g. racist is on you. Or, for that matter, using to reject health insurance claims, I think we saw some news lately out of the US what can happen when you do that.
OpenAI’s copyright case isn’t really good to make the legal situation any clearer: We already know that using pirated content to train stuff isn’t legal because you’re not looking at it legitimately. The case isn’t about the “are computers allowed to learn from public sources just as humans are” question.
the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world
They are not “analyzing” the data. They are feeding it into a regurgitating mechanism. There’s a big difference. Their defense is only “good” because AI is being misrepresented and misunderstood.
I agree that we shouldn’t strive for more strict copyright. We should fight for a much more liberal system. But as long as everyone else has to live by the current copyright laws, we should not let AI companies get away with what they’re doing.
I’ve never really delved into the AI copyright debate before, so forgive my ignorance on the matter.
I don’t understand how an AI reading a bunch of books and rearranging some of those words into a new story, is different to a human author reading a bunch of books and rearranging those words into a new story.
Most AI art I’ve seen has been… Unique, to say the least. To me, they tend to be different enough to the art they were trained in to not be a direct ripoff, so personally I don’t see the issue.
I think the the main difference is one being a human author and this is how humans function. We can not unsee or unhear things but we can be compelled to not use that information if the law requires so company secrets/inadmissible evidence in jury duty/plagiarism laws that already exist. And the other being a machine that do not have agency or personhood that has this information being fed to it ( created by other people ) for the sole purpose of creating a closed system for a company so it’s shareholders can make money. It’s this open for me but not for thee approach is the main problem people have. You have this proprietary “open ai” that microsoft invested 25 or so billion in so they can scrape other peoples work and charge you money for variations of it. I don’t mind abolishing ip or patent laws all together so everyone can use and improve chatgpt with whatever they have. If you yourself are hiding behind ip laws to protect your software and disrespecting other peoples copyright laws that’s what people see as problematic.
Yes, this is my exact issue with some framing of AI. Creative people love their influences to the point you can ask them and they will point to parts that they reference or nudged to an influence they partially credit to getting to that result. It’s also extremely normal that when you make something new, you brainstorm and analyze any kind of material (copyrighted or not) you can find that gives the same feelings you desire to create. As is ironically said to give comfort to starting creatives that it’s okay to be inspired by others: “Good artists copy, great artists steal.”
And often people very anti AI don’t see an issue with this, yet it is in essence the same as the AI does, which is to detach the work from the ideas it was built on, and then re-using those ideas. And just like anyone who has the ability to create has the ability to plagiarize or infringe, so does the AI. As human users of AI we must be the ones to ethically guide it away from that (Since it can’t do that itself), just like you would not copy-paste your influences into a new human made work.
The for-profit large-scale media blender is the problem. When it’s a human writing Harry Potter fan fiction, it’s fine. When a company sells a tool for you to write thousands of trash “books” for profit, it’s a problem.
Which is why the technology itself isn’t the issue, but those willing to use it in unethical ways. AI is an invaluable tool to those with limited means, unlike big corporations.
I don’t understand how an AI reading a bunch of books and rearranging some of those words into a new story, is different to a human author reading a bunch of books and rearranging those words into a new story.
Ok, let’s say for now that these things are actually similar. Is a human legally allowed to “rearrange those words” in any way they want? Not really, because they can’t copy stuff like characters or plot structure. Even if the copy is not verbatim, it has to avoid being “too similar”. It’s not always clear where the threshold is; that will be judged in court. But imagine if your were being sued for copyright infringement because of perceived similarities between your work and another creator’s. You go to court and say “Well I torrented the plaintiff’s work and studied it with the express intent to copy discernible patterns in it, then sell my work based on those patterns”. As long as the similarities are found to be valid, you’re most likely to lose. The fact that you’ve spent years campaigning how companies can save a lot of money by firing artists and hiring your pattern-replicating service instead probably wouldn’t help your case either. Well, that’s basically what an honest defense of AI against copyright infringement would be. So the question is, does AI actually produce output too similar to its training data? Well, this is an example of articles you can find on the topic…
So based on the above thoughts, do you feel like we hold AI generation to the same standard as we do human creators? It doesn’t seem so to me.
But there’s a lot of reasons why we should hold AIs to higher standards instead. Off the top of my head:
AIs have been created exclusively to replicate patterns in existing works. This is not the only function people have. So we don’t have to wonder whether similarities between AI inputs and outputs are coincidental. We don’t have to worry about whether overbearing restrictions might inadvertently affect some other function.
AIs have no feelings or needs. We don’t have to worry about causing direct harm to them and about protecting their rights. Forbidding a person from reading a book just in case they copy elements from it is obviously problematic, but restricting AI’s access to copyrighted work is not directly harmful in the same way.
ML algorithms aren’t capable of producing anything new, they can only ever produce a mishmash of copies of existing works.
If you feed a generative model a bunch of physics research papers, it won’t create a new valid physics research paper, just a mishmash of jargon from existing papers.
You say it’s not capable of producing anything new, but then give an example of it creating something new. You just changed the goal from “new” to “valid” in the next sentence. Looking at AI for “valid” information is silly, but looking at it for “new” information is not. Humans do this kind of information mixing all the time. It’s why fan works are a thing, and why most creative people have influences they credit with being where they are today.
Nobody alive today isn’t tainted by the ideas they’ve consumed in copyrighted works, but we do not bat an eye if you use that in a transformative manner. And AI already does this transformation much better than humans do since it’s trained on that much more information, diluting the pool of sources, which effectively means less information from a single source is used.
If I write the sentence “Hello, I just got home” and use an algorithm to jumble it into “got Hello, just I home” there’s nothing new there.
There’s no transformation, it’s not capable of transformation, it’s just a very complicated text jumbler that’s supposed to jumble text so that the output is readable by humans.
You’re taking investment advice from a parrot that had the entirety of reddit investment meme subreddits beamed into its brain.
That’s a very short example, but it is a new arrangement of the existing information. It’s not a new valuable arrangement of information, but new nonetheless. And yes, rearrangement is transformation. It’s very low entropy transformation, but transformation nonetheless. Collages and summaries are in fact, a thing that humans make too.
Unless you mean “new” as in, something nobody’s ever written before, in which case not even you can create new information, since pretty much everything you will ever say or write down can be broken down into pieces that have been spoken or written before, which is not exactly a useful distinction.
There’s no transformation, it’s not capable of transformation, it’s just a very complicated text jumbler that’s supposed to jumble text so that the output is readable by humans.
Saying it doesn’t make it true, especially when you follow it up with a self-debunk by saying it transforms the text by jumbling it in specific ways that keep it readable to humans, which requires transformation as like you just demonstrated, randomly swapping words does not make legible text…
You’re taking investment advice from a parrot that had the entirety of reddit investment meme subreddits beamed into its brain.
They are not “analyzing” the data. They are feeding it into a regurgitating mechanism. There’s a big difference. Their defense is only “good” because AI is being misrepresented and misunderstood.
I really kind of hope you’re kidding here. Because this has got to be the most roundabout way of saying they’re analyzing the information. Just because you think it does so to regurgitate (which I have yet to see any good evidence for, at least for the larger models), does not change the definition of analyzing. And by doing so you are misrepresenting it and showing you might just have misunderstood it, which is ironic. And doing so does not help the cause of anyone who wishes to reduce the harm from AI, as you are literally giving ammo to people to point to and say you are being irrational about it.
Yes if you completely ignore how data is processed and how the product is derived from the data, then everything can be labeled “data analysis”. Great point. So copyright infringement can never exist because the original work can always be considered data that you analyze. Incredible.
No, not what I said at all. If you’re trying to say I’m making this argument I’d urge you (ironically) to actually analyze what I said rather than putting words in my mouth ;) (Or just, you know, ask me to clarify)
Copyright infringement (or plagiarism) in it’s simplest form, as in just taking the material as is, is devoid of any analysis. The point is to avoid having to do that analysis and just get right to the end result that has value.
But that’s not what AI technology does. None of the material used to train it ends up in the model. It looks at the training data and extracts patterns. For text, that is the sentence structure, the likelihood of words being followed by another, the paragraph/line length, the relationship between words when used together, and more. It can do all of this without even ‘knowing’ what these things are, because they are simply patterns that show up in large amounts of data, and machine learning as a technology is made to be able to detect and extract those patterns. That detection is synonymous with how humans do analysis. What it detects are empirical, factual observations about the material it is shown, which cannot be copyrighted.
The resulting data when fed back to the AI can be used to have it extrapolate on incomplete data, which it could not do without such analysis. You can see this quite easily by asking an AI to refer to you by a specific name, or talk in a specific manner, such as a pirate. It ‘understands’ that certain words are placeholders for names, and that text can be ‘pirateitfied’ by adding filler words or pre/suffixing other words. It could not do so without analysis, unless that exact text was already in the data to begin with, which is doubtful.
No, not what I said at all. If you’re trying to say I’m making this argument I’d urge you (ironically) to actually analyze what I said rather than putting words in my mouth ;) (Or just, you know, ask me to clarify)
That was your implied argument regardless of intent.
Copyright infringement (or plagiarism) in it’s simplest form, as in just taking the material as is, is devoid of any analysis. The point is to avoid having to do that analysis and just get right to the end result that has value.
Completely wrong, which invalidates the point you want to make. “Analysis” and “as is” have no place in the definition of copyright infringement. A derivative work can be very different from the original material, and how you created the derivative work, including whether you performed whatever you think “analysis” means, is generally irrelevant.
What it detects are empirical, factual observations about the material it is shown, which cannot be copyrighted.
No it detects patterns. You already said it correctly above. And the problem is that some patterns can be copyrighted. That’s exactly the problem highlighted here and here. For copyright law, it doesn’t matter if, for example, that particular image of Mario is copied verbatim from the training data. The character likeness, which is encoded in the model because it is in fact a discernible pattern, is an infringement.
That was your implied argument regardless of intent.
I decide what my argument is, thank you very much. Your interpretation of it is outside of my control, and while I might try to avoid it from going astray, I cannot stop it from doing so, that’s on you.
Completely wrong, which invalidates the point you want to make. “Analysis” and “as is” have no place in the definition of copyright infringement. A derivative work can be very different from the original material, and how you created the derivative work, including whether you performed whatever you think “analysis” means, is generally irrelevant.
I wasn’t giving a definition of copyright infringement, since that depends on the jurisdiction, and since you and I aren’t in the same one most likely, that’s nothing I would argue for to begin with. In the most basic form of plagiarism, people do so to avoid doing the effort of transformation. More complex forms of plagiarism might involve some transformation, but still try to capture the expression of the original, instead of the ideas. Analysis is definitely relevant, since to create a work that does not infringe on copyright, you generally can take ideas from a copyrighted work, but not the expression of those ideas. If a new work is based on just those ideas (and preferably mixes it with new ideas), it generally doesn’t infringe on copyright. It’s why there are so many copycat products of everything you can think of, that aren’t copyright infringing.
No it detects patterns. You already said it correctly above. And the problem is that some patterns can be copyrighted. That’s exactly the problem highlighted here and here. For copyright law, it doesn’t matter if, for example, that particular image of Mario is copied verbatim from the training data.
While depending on your definition Mario could be a sufficiently complex pattern, that’s not the definition I’m using. Mario isn’t a pattern, it’s an expression of multiple patterns. Patterns like “an italian man”, “a big moustache”, “a red rounded hat with the letter ‘M’ in a white circle”, “overalls”. You can use any of those patterns in a new non-infringing work, Nintendo has no copyright on any of those patterns. But bring them all together in one place again without adding new patterns, and you will have infringed on the expression of Mario. If you give many images of Mario to the AI it might be able to understand that those patterns together are some sort of “Mario-ness” pattern, but it can still separate them from each other since you aren’t just showing it Mario, but also other images that have these same patterns in different expressions.
Mario’s likeness isn’t in the model, but it’s patterns are. And if an unethical user of the AI wants to prompt it for those specific patterns to be surprised they get Mario, or something close enough to be substantially similar, that’s on them, and it will be infringing just like drawing and selling a copy of Mario without Nintendo’s approval is now.
The character likeness, which is encoded in the model because it is in fact a discernible pattern, is an infringement.
You have absolutely no legal basis to claim they are infringement, as these things simply have not been settled in court. You can be of the opinion that they are infringement, but your opinion isn’t the same as law. The articles you showed are also simply reporting and speculating on the lawsuits that are pending.
Plagiarism is not the same as copyright infringement. Why you think people probably plagiarize is doubly irrelevant then.
Analysis is definitely relevant, since to create a work that does not infringe on copyright
Show me literally any example of the defendant’s use of “analysis” having any impact whatsoever in a copyright infringement case or a law that explicitly talks about it, or just stop repeating that it is in any way relevant to copyright.
But bring them all together in one place again without adding new patterns
Wrong. The “all together” and “without adding new patterns” are not legal requirements. You are constantly trying to push the definition of copyright infringement to be more extreme to make it easier for you to argue.
you generally can take ideas from a copyrighted work, but not the expression of those ideas
Unfortunately, an AI has no concept of ideas, and it simply encodes patterns, whatever they might happen to be. Again, you’re morphing the discussion to make an argument.
Mario’s likeness isn’t in the model, but it’s patterns are.
Mario’s likeness has to be encoded into the model in some way. Otherwise, this would not have been the image generated for “draw an italian plumber from a video game”. There is absolutely nothing in the prompt to push GPT-4 to combine those elements. There are also no “new” patterns, as you put it. That’s exactly the point of the article. As they put it:
Clearly, these models did not just learn abstract facts about plumbers—for example, that they wear overalls and carry wrenches. They learned facts about a specific fictional Italian plumber who wears white gloves, blue overalls with yellow buttons, and a red hat with an “M” on the front.
These are not facts about the world that lie beyond the reach of copyright. Rather, the creative choices that define Mario are likely covered by copyrights held by Nintendo.
This is contradictory to how you present it as “taking ideas”.
You have absolutely no legal basis to claim they are infringement
You’re mixing up different things. I’m saying that the image contains infringing material, which is hopefully not something you have to be convinced about. The production of an obviously infringing image, without the infringing elements having been provided in the prompt, is used to show how this information is encoded inside the model in some form. Whether this copyright-protected material exists in some form inside the model is not an equivalent question to whether this is copyright infringement. You are right that the courts have not decided on the latter, but we have been talking about the former. I repeat your position which I was directly responding to before:
What it detects are empirical, factual observations about the material it is shown, which cannot be copyrighted.
Plagiarism is not the same as copyright infringement. Why you think people probably plagiarize is doubly irrelevant then.
I never claimed it was, but as I said before, it is irrelevant because copyright infringement differs in places depending on the local laws, but plagiarism is usually the concept that guides the ethical position from which those laws are produced, which is why yes, it’s relevant.
Show me literally any example of the defendant’s use of “analysis” having any impact whatsoever in a copyright infringement case or a law that explicitly talks about it, or just stop repeating that it is in any way relevant to copyright.
This is an unreasonable request, and you know it to be. Again, we don’t share the same laws and different jurisdictions provide different exceptions like fair use, fair dealing, or just straight up exclusion from copyright for their use. But it is wholly besides my argument. You can look at any piece of modern media that exists in the same space and see ideas the two share, while not sharing the same expression of that idea. How some characters fulfill the same purpose, dress the same way, or have similar personalities. You are free to make a book with a plumber, a mustached man, someone wearing a red hat with the letter M on it, and someone that goes to save a princess from a castle, but if they’re not the same person they are most likely not considered to be the protected expression of Mario. Same ideas that make up Mario, one infringing, the other not.
Nobody goes to court over this because EVERYONE takes each others ideas, “Good artists copy, great artists steal”. It’s only when you step on the specific expression of an idea that it becomes realistically actionable, and at that point transformativeness is definitely discussed almost every single time, because it is critical to determining the copyright was actually infringed, or if not.
Wrong. The “all together” and “without adding new patterns” are not legal requirements. You are constantly trying to push the definition of copyright infringement to be more extreme to make it easier for you to argue.
I’m sorry but, are you really being this dishonest? I’ve mentioned EXPLICITLY in my last comment that I wasn’t giving a definition of copyright infringement, because it’s besides the point, and not what I’m claiming. Yet here you are saying I am “trying to push” a definition. We are not lawyers or law scholars speaking to each other, I am having a discussion with you as another anonymous person on a message board.
Unfortunately, an AI has no concept of ideas, and it simply encodes patterns, whatever they might happen to be.
You are just arguing semantics and linguistics, it’s meaningless. We are not talking technical specifics, not even a specific model, nor a specific technique to specific exactly how the information is encoded. It’s a rough concept of “ideas” / “data” / “patterns”: information. And AI definitely has that.
Again, you’re morphing the discussion to make an argument.
You mean, I’m making an argument. Because yes. I am. I don’t see why this negative framing is necessary nor why this is noteworthy enough to bring up, unless you really just want to make me look bad for no apparent reason.
Mario’s likeness has to be encoded into the model in some way. Otherwise, this would not have been the image generated for “draw an italian plumber from a video game”. There is absolutely nothing in the prompt to push GPT-4 to combine those elements. There are also no “new” patterns, as you put it. That’s exactly the point of the article. As they put it:
Yes, there is some idea/pattern of “Mario-ness” in the model, I said that. This was not me trying to say no material of Mario was used in training, but that it’s not like someone pasted direct images of Mario in there, but that AI models makes logical connections between concepts and even for things we cannot put a good name to does it make those connections, and will allow you to prompt for them, but that does not mean you should.
Clearly, these models did not just learn abstract facts about plumbers—for example, that they wear overalls and carry wrenches. They learned facts about a specific fictional Italian plumber who wears white gloves, blue overalls with yellow buttons, and a red hat with an “M” on the front.
These are not facts about the world that lie beyond the reach of copyright. Rather, the creative choices that define Mario are likely covered by copyrights held by Nintendo.
I sort of already explained this without mentioning this specific example, but I’ll make it extra clear.
In the article they prompted the AI for a “video game Italian plumber”.
What person, if you asked them, to think of an “Italian video game plumber”, would not think of Mario? Maybe Luigi?
I’ll tell you, because there are very damn few famous Italian video game plumbers. The prompt is already locked in on Mario, and even humans make the logical connection to Mario. It might have had billions of images and texts to use, but any time a relation to an “Italian video game plumber” showed up, there’s Mario.
So this whole point the article makes about it not learning abstract facts about plumbers, is complete moot because they completely biased the outputs towards receiving what they want to receive. If you ask for just a plumber, for which it does have many, many results. It will make more generalizations and become less specific. Because there are more than 2 examples of plumbers in other types of situations. Humans do this exact same thing in the same task, yet somehow the AI must be infallible to this despite being artificial versions of the biological thing. And that is why analysis is protected, because humans simply cannot stop doing it and everyone is tainted by their knowledge of Mario, even though for whatever reason we might need to use one of the ideas Mario is built upon. And this is why AIs use this same defense. I can say this regardless of the jurisdiction because unless you live in some kind of dictatorship this is generally true.
Sadly, this kind of deceptive framing of AI output is common, particularly among those that are biased against AI. Sometimes it’s unintentional, but frequently specific parameters are used that will just generate specific bad results, ignoring that this may not even represent 0.001% of what the model can generate in normal situations.
This is contradictory to how you present it as “taking ideas”.
It is not. You can use the idea of Mario, you cannot use the totality of Mario. For the AI to be able to use the idea of Mario, it will also ‘learn’ the totality of Mario in the process, as Mario is a collection of ideas that are extracted. But those ideas are stored separately so they can be individually prompted for. You can prompt it to make Mario, because like literally almost every person in society, they know what ideas make up Mario better than I can put to words here. If I hire a human artist to make me a “video game Italian plumber”, their first question to me would be “Oh, something like Mario?” and their second response will be “Oh I can’t do that, and you should not want to, because you don’t own Mario.”. Humans use AI, so they need to be the ones to give that second response.
Just like a kitchen knife can be used to stab someone, doesn’t mean we produce kitchen knives for stabbing people. Just because an AI can be used to infringe, does not mean that they are produced to infringe. Which is evidence by the vast majority of other ways that it can be used that don’t infringe, which is self evident after just tinkering around with it for a little while.
You’re mixing up different things. I’m saying that the image contains infringing material, which is hopefully not something you have to be convinced about. The production of an obviously infringing image, without the infringing elements having been provided in the prompt, is used to show how this information is encoded inside the model in some form. Whether this copyright-protected material exists in some form inside the model is not an equivalent question to whether this is copyright infringement. You are right that the courts have not decided on the latter, but we have been talking about the former. I repeat your position which I was directly responding to before:
If it’s anything like the examples before, then the AI has definitely been prompted by the user to make infringing elements.
But anyways, to the question, you just don’t seem to grasp that collections of ideas can communicate copyright infringing material without being infringing on their own. It’s like arguing that if Paint or Photoshop knows about the color red that this is copyright infringing because it’s the same red that Mario uses. None of the ideas that make up Mario are infringing, and cannot be copyrighted. They are what the AI is designed to extract, not Mario as a totality.
You can definitely use AI to make an infringement machine by making it less likely to make leaps in ideas and just only combine the ideas it’s been taught on, which we as humans can do as well in the form of plagiarism and forgery. But if you’re going to be unethical why use an AI when you might as well just take the easy route directly with print screen or a photo. Two other technologies we didn’t ban for having this ability to capture copyrighted material, even if they far more blatantly copy the material.
This is where good AI usage deviates, because it instead tries to MAXIMIZE the amount of leaps and connections the AI makes for as little possibility to make something infringing. Even honest people trying to make new creative works sometimes have to change things because they might be too close to being infringing.
There are law offices that exist specifically to fuck with people over patent and copyright law.
There’s also cases where people use copyright and patent law to hold us back. I can’t find the article but some religious jerk patented connecting a sex toy to a computer via USB. Thankfully someone got around this law with bluetooth and cell phones. Otherwise I imagine the camgirl and LDR market for toys would’ve been hit with products 10 years sooner.
Although I’m a firm believer that most AI models should be public domain or open source by default, the premise of “illegally trained LLMs” is flawed. Because there really is no assurance that LLMs currently in use are illegally trained to begin with. These things are still being argued in court, but the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world.
The idea of… well, ideas, being copyrightable, should shake the boots of anyone in this discussion. Especially since when the laws on the book around these kinds of things become active topic of change, they rarely shift in the direction of more freedom for the exact people we want to give it to. See: Copyright and Disney.
The underlying technology simply has more than enough good uses that banning it would simply cause it to flourish elsewhere that does not ban it, which means as usual that everyone but the multinational companies lose out. The same would happen with more strict copyright, as only the big companies have the means to build their own models with their own data. The general public is set up for a lose-lose to these companies as it currently stands. By requiring the models to be made available to the public do we ensure that the playing field doesn’t tip further into their favor to the point AI technology only exists to benefit them.
If the model is built on the corpus of humanity, then humanity should benefit.
As per torrentfreak
Should be easy to defend against, right-out trivial: OpenAI, just tell us what those Books1 and Books2 databases are. Where you got them from, the licensing contracts with publishers that you signed to give you access to such a gigantic library. No need to divulge details, just give us information that makes it believable that you licensed them.
…crickets. They pirated the lot of it otherwise they would already have gotten that case thrown out. It’s US startup culture, plain and simple, “move fast and break laws”, get lots of money, have lots of money enabling you to pay the best lawyers to abuse the shit out of the US court system.
For OpenAI, I really wouldn’t be surprised if that happened to be the case, considering they still call themselves “OpenAI” despite being the most censored and closed source AI models on the market.
But my comment was more aimed at AI models in general. If you are assuming they indeed used non-publicly posted or gathered material, and did so directly themselves, they would indeed not have a defense to that. Unfortunately, if a second hand provided them the data, and did so under false pretenses, it would likely let them legally off the hook even if they had every ethical obligation to make sure it was publicly available. The second hand that provided it to them would be the one infringing.
If that assumption turns out to be a truth (Maybe through some kind of discovery in the trial), they should burn for that. Until then, even if it’s a justified assumption, it’s still an assumption, and most likely not true for most models, certainly not those trained recently.
Banning AI is out of the question. Even the EU accepts that and they tend to be pretty ban heavy, unlike the US.
But it’s important that we have these discussions about how copyright applies to AI so that we can actually get an answer and move on, right now it’s this legal quagmire that no one really wants to get involved in except the big companies. If a small group of university students want to build an AI right now they can’t because of the legal nightmare that would be the Twilight zone of law that is acquiring training data.
AI is right-out unregulated in the EU unless and until you actually use it for something where it becomes relevant, then you’ve got at the lower end labelling requirements (If your customer service is an AI chat, say that it’s an AI chat), up to heavy, heavy requirements when you use it for stuff like sifting through job applications. The burden of proof that the AI isn’t e.g. racist is on you. Or, for that matter, using to reject health insurance claims, I think we saw some news lately out of the US what can happen when you do that.
OpenAI’s copyright case isn’t really good to make the legal situation any clearer: We already know that using pirated content to train stuff isn’t legal because you’re not looking at it legitimately. The case isn’t about the “are computers allowed to learn from public sources just as humans are” question.
They are not “analyzing” the data. They are feeding it into a regurgitating mechanism. There’s a big difference. Their defense is only “good” because AI is being misrepresented and misunderstood.
I agree that we shouldn’t strive for more strict copyright. We should fight for a much more liberal system. But as long as everyone else has to live by the current copyright laws, we should not let AI companies get away with what they’re doing.
I’ve never really delved into the AI copyright debate before, so forgive my ignorance on the matter.
I don’t understand how an AI reading a bunch of books and rearranging some of those words into a new story, is different to a human author reading a bunch of books and rearranging those words into a new story.
Most AI art I’ve seen has been… Unique, to say the least. To me, they tend to be different enough to the art they were trained in to not be a direct ripoff, so personally I don’t see the issue.
I think the the main difference is one being a human author and this is how humans function. We can not unsee or unhear things but we can be compelled to not use that information if the law requires so company secrets/inadmissible evidence in jury duty/plagiarism laws that already exist. And the other being a machine that do not have agency or personhood that has this information being fed to it ( created by other people ) for the sole purpose of creating a closed system for a company so it’s shareholders can make money. It’s this open for me but not for thee approach is the main problem people have. You have this proprietary “open ai” that microsoft invested 25 or so billion in so they can scrape other peoples work and charge you money for variations of it. I don’t mind abolishing ip or patent laws all together so everyone can use and improve chatgpt with whatever they have. If you yourself are hiding behind ip laws to protect your software and disrespecting other peoples copyright laws that’s what people see as problematic.
Yes, this is my exact issue with some framing of AI. Creative people love their influences to the point you can ask them and they will point to parts that they reference or nudged to an influence they partially credit to getting to that result. It’s also extremely normal that when you make something new, you brainstorm and analyze any kind of material (copyrighted or not) you can find that gives the same feelings you desire to create. As is ironically said to give comfort to starting creatives that it’s okay to be inspired by others: “Good artists copy, great artists steal.”
And often people very anti AI don’t see an issue with this, yet it is in essence the same as the AI does, which is to detach the work from the ideas it was built on, and then re-using those ideas. And just like anyone who has the ability to create has the ability to plagiarize or infringe, so does the AI. As human users of AI we must be the ones to ethically guide it away from that (Since it can’t do that itself), just like you would not copy-paste your influences into a new human made work.
The for-profit large-scale media blender is the problem. When it’s a human writing Harry Potter fan fiction, it’s fine. When a company sells a tool for you to write thousands of trash “books” for profit, it’s a problem.
Which is why the technology itself isn’t the issue, but those willing to use it in unethical ways. AI is an invaluable tool to those with limited means, unlike big corporations.
Ok, let’s say for now that these things are actually similar. Is a human legally allowed to “rearrange those words” in any way they want? Not really, because they can’t copy stuff like characters or plot structure. Even if the copy is not verbatim, it has to avoid being “too similar”. It’s not always clear where the threshold is; that will be judged in court. But imagine if your were being sued for copyright infringement because of perceived similarities between your work and another creator’s. You go to court and say “Well I torrented the plaintiff’s work and studied it with the express intent to copy discernible patterns in it, then sell my work based on those patterns”. As long as the similarities are found to be valid, you’re most likely to lose. The fact that you’ve spent years campaigning how companies can save a lot of money by firing artists and hiring your pattern-replicating service instead probably wouldn’t help your case either. Well, that’s basically what an honest defense of AI against copyright infringement would be. So the question is, does AI actually produce output too similar to its training data? Well, this is an example of articles you can find on the topic…
So based on the above thoughts, do you feel like we hold AI generation to the same standard as we do human creators? It doesn’t seem so to me.
But there’s a lot of reasons why we should hold AIs to higher standards instead. Off the top of my head:
ML algorithms aren’t capable of producing anything new, they can only ever produce a mishmash of copies of existing works.
If you feed a generative model a bunch of physics research papers, it won’t create a new valid physics research paper, just a mishmash of jargon from existing papers.
You say it’s not capable of producing anything new, but then give an example of it creating something new. You just changed the goal from “new” to “valid” in the next sentence. Looking at AI for “valid” information is silly, but looking at it for “new” information is not. Humans do this kind of information mixing all the time. It’s why fan works are a thing, and why most creative people have influences they credit with being where they are today.
Nobody alive today isn’t tainted by the ideas they’ve consumed in copyrighted works, but we do not bat an eye if you use that in a transformative manner. And AI already does this transformation much better than humans do since it’s trained on that much more information, diluting the pool of sources, which effectively means less information from a single source is used.
It doesn’t give you new information.
If I write the sentence “Hello, I just got home” and use an algorithm to jumble it into “got Hello, just I home” there’s nothing new there.
There’s no transformation, it’s not capable of transformation, it’s just a very complicated text jumbler that’s supposed to jumble text so that the output is readable by humans.
You’re taking investment advice from a parrot that had the entirety of reddit investment meme subreddits beamed into its brain.
That’s a very short example, but it is a new arrangement of the existing information. It’s not a new valuable arrangement of information, but new nonetheless. And yes, rearrangement is transformation. It’s very low entropy transformation, but transformation nonetheless. Collages and summaries are in fact, a thing that humans make too.
Unless you mean “new” as in, something nobody’s ever written before, in which case not even you can create new information, since pretty much everything you will ever say or write down can be broken down into pieces that have been spoken or written before, which is not exactly a useful distinction.
Saying it doesn’t make it true, especially when you follow it up with a self-debunk by saying it transforms the text by jumbling it in specific ways that keep it readable to humans, which requires transformation as like you just demonstrated, randomly swapping words does not make legible text…
???
https://youtu.be/2TRmaAxHDDU
I really kind of hope you’re kidding here. Because this has got to be the most roundabout way of saying they’re analyzing the information. Just because you think it does so to regurgitate (which I have yet to see any good evidence for, at least for the larger models), does not change the definition of analyzing. And by doing so you are misrepresenting it and showing you might just have misunderstood it, which is ironic. And doing so does not help the cause of anyone who wishes to reduce the harm from AI, as you are literally giving ammo to people to point to and say you are being irrational about it.
Yes if you completely ignore how data is processed and how the product is derived from the data, then everything can be labeled “data analysis”. Great point. So copyright infringement can never exist because the original work can always be considered data that you analyze. Incredible.
No, not what I said at all. If you’re trying to say I’m making this argument I’d urge you (ironically) to actually analyze what I said rather than putting words in my mouth ;) (Or just, you know, ask me to clarify)
Copyright infringement (or plagiarism) in it’s simplest form, as in just taking the material as is, is devoid of any analysis. The point is to avoid having to do that analysis and just get right to the end result that has value.
But that’s not what AI technology does. None of the material used to train it ends up in the model. It looks at the training data and extracts patterns. For text, that is the sentence structure, the likelihood of words being followed by another, the paragraph/line length, the relationship between words when used together, and more. It can do all of this without even ‘knowing’ what these things are, because they are simply patterns that show up in large amounts of data, and machine learning as a technology is made to be able to detect and extract those patterns. That detection is synonymous with how humans do analysis. What it detects are empirical, factual observations about the material it is shown, which cannot be copyrighted.
The resulting data when fed back to the AI can be used to have it extrapolate on incomplete data, which it could not do without such analysis. You can see this quite easily by asking an AI to refer to you by a specific name, or talk in a specific manner, such as a pirate. It ‘understands’ that certain words are placeholders for names, and that text can be ‘pirateitfied’ by adding filler words or pre/suffixing other words. It could not do so without analysis, unless that exact text was already in the data to begin with, which is doubtful.
That was your implied argument regardless of intent.
Completely wrong, which invalidates the point you want to make. “Analysis” and “as is” have no place in the definition of copyright infringement. A derivative work can be very different from the original material, and how you created the derivative work, including whether you performed whatever you think “analysis” means, is generally irrelevant.
No it detects patterns. You already said it correctly above. And the problem is that some patterns can be copyrighted. That’s exactly the problem highlighted here and here. For copyright law, it doesn’t matter if, for example, that particular image of Mario is copied verbatim from the training data. The character likeness, which is encoded in the model because it is in fact a discernible pattern, is an infringement.
I decide what my argument is, thank you very much. Your interpretation of it is outside of my control, and while I might try to avoid it from going astray, I cannot stop it from doing so, that’s on you.
I wasn’t giving a definition of copyright infringement, since that depends on the jurisdiction, and since you and I aren’t in the same one most likely, that’s nothing I would argue for to begin with. In the most basic form of plagiarism, people do so to avoid doing the effort of transformation. More complex forms of plagiarism might involve some transformation, but still try to capture the expression of the original, instead of the ideas. Analysis is definitely relevant, since to create a work that does not infringe on copyright, you generally can take ideas from a copyrighted work, but not the expression of those ideas. If a new work is based on just those ideas (and preferably mixes it with new ideas), it generally doesn’t infringe on copyright. It’s why there are so many copycat products of everything you can think of, that aren’t copyright infringing.
While depending on your definition Mario could be a sufficiently complex pattern, that’s not the definition I’m using. Mario isn’t a pattern, it’s an expression of multiple patterns. Patterns like “an italian man”, “a big moustache”, “a red rounded hat with the letter ‘M’ in a white circle”, “overalls”. You can use any of those patterns in a new non-infringing work, Nintendo has no copyright on any of those patterns. But bring them all together in one place again without adding new patterns, and you will have infringed on the expression of Mario. If you give many images of Mario to the AI it might be able to understand that those patterns together are some sort of “Mario-ness” pattern, but it can still separate them from each other since you aren’t just showing it Mario, but also other images that have these same patterns in different expressions.
Mario’s likeness isn’t in the model, but it’s patterns are. And if an unethical user of the AI wants to prompt it for those specific patterns to be surprised they get Mario, or something close enough to be substantially similar, that’s on them, and it will be infringing just like drawing and selling a copy of Mario without Nintendo’s approval is now.
You have absolutely no legal basis to claim they are infringement, as these things simply have not been settled in court. You can be of the opinion that they are infringement, but your opinion isn’t the same as law. The articles you showed are also simply reporting and speculating on the lawsuits that are pending.
Plagiarism is not the same as copyright infringement. Why you think people probably plagiarize is doubly irrelevant then.
Show me literally any example of the defendant’s use of “analysis” having any impact whatsoever in a copyright infringement case or a law that explicitly talks about it, or just stop repeating that it is in any way relevant to copyright.
Wrong. The “all together” and “without adding new patterns” are not legal requirements. You are constantly trying to push the definition of copyright infringement to be more extreme to make it easier for you to argue.
Unfortunately, an AI has no concept of ideas, and it simply encodes patterns, whatever they might happen to be. Again, you’re morphing the discussion to make an argument.
Mario’s likeness has to be encoded into the model in some way. Otherwise, this would not have been the image generated for “draw an italian plumber from a video game”. There is absolutely nothing in the prompt to push GPT-4 to combine those elements. There are also no “new” patterns, as you put it. That’s exactly the point of the article. As they put it:
This is contradictory to how you present it as “taking ideas”.
You’re mixing up different things. I’m saying that the image contains infringing material, which is hopefully not something you have to be convinced about. The production of an obviously infringing image, without the infringing elements having been provided in the prompt, is used to show how this information is encoded inside the model in some form. Whether this copyright-protected material exists in some form inside the model is not an equivalent question to whether this is copyright infringement. You are right that the courts have not decided on the latter, but we have been talking about the former. I repeat your position which I was directly responding to before:
I never claimed it was, but as I said before, it is irrelevant because copyright infringement differs in places depending on the local laws, but plagiarism is usually the concept that guides the ethical position from which those laws are produced, which is why yes, it’s relevant.
This is an unreasonable request, and you know it to be. Again, we don’t share the same laws and different jurisdictions provide different exceptions like fair use, fair dealing, or just straight up exclusion from copyright for their use. But it is wholly besides my argument. You can look at any piece of modern media that exists in the same space and see ideas the two share, while not sharing the same expression of that idea. How some characters fulfill the same purpose, dress the same way, or have similar personalities. You are free to make a book with a plumber, a mustached man, someone wearing a red hat with the letter M on it, and someone that goes to save a princess from a castle, but if they’re not the same person they are most likely not considered to be the protected expression of Mario. Same ideas that make up Mario, one infringing, the other not.
Nobody goes to court over this because EVERYONE takes each others ideas, “Good artists copy, great artists steal”. It’s only when you step on the specific expression of an idea that it becomes realistically actionable, and at that point transformativeness is definitely discussed almost every single time, because it is critical to determining the copyright was actually infringed, or if not.
I’m sorry but, are you really being this dishonest? I’ve mentioned EXPLICITLY in my last comment that I wasn’t giving a definition of copyright infringement, because it’s besides the point, and not what I’m claiming. Yet here you are saying I am “trying to push” a definition. We are not lawyers or law scholars speaking to each other, I am having a discussion with you as another anonymous person on a message board.
You are just arguing semantics and linguistics, it’s meaningless. We are not talking technical specifics, not even a specific model, nor a specific technique to specific exactly how the information is encoded. It’s a rough concept of “ideas” / “data” / “patterns”: information. And AI definitely has that.
You mean, I’m making an argument. Because yes. I am. I don’t see why this negative framing is necessary nor why this is noteworthy enough to bring up, unless you really just want to make me look bad for no apparent reason.
Yes, there is some idea/pattern of “Mario-ness” in the model, I said that. This was not me trying to say no material of Mario was used in training, but that it’s not like someone pasted direct images of Mario in there, but that AI models makes logical connections between concepts and even for things we cannot put a good name to does it make those connections, and will allow you to prompt for them, but that does not mean you should.
I sort of already explained this without mentioning this specific example, but I’ll make it extra clear.
In the article they prompted the AI for a “video game Italian plumber”. What person, if you asked them, to think of an “Italian video game plumber”, would not think of Mario? Maybe Luigi? I’ll tell you, because there are very damn few famous Italian video game plumbers. The prompt is already locked in on Mario, and even humans make the logical connection to Mario. It might have had billions of images and texts to use, but any time a relation to an “Italian video game plumber” showed up, there’s Mario.
So this whole point the article makes about it not learning abstract facts about plumbers, is complete moot because they completely biased the outputs towards receiving what they want to receive. If you ask for just a plumber, for which it does have many, many results. It will make more generalizations and become less specific. Because there are more than 2 examples of plumbers in other types of situations. Humans do this exact same thing in the same task, yet somehow the AI must be infallible to this despite being artificial versions of the biological thing. And that is why analysis is protected, because humans simply cannot stop doing it and everyone is tainted by their knowledge of Mario, even though for whatever reason we might need to use one of the ideas Mario is built upon. And this is why AIs use this same defense. I can say this regardless of the jurisdiction because unless you live in some kind of dictatorship this is generally true.
Sadly, this kind of deceptive framing of AI output is common, particularly among those that are biased against AI. Sometimes it’s unintentional, but frequently specific parameters are used that will just generate specific bad results, ignoring that this may not even represent 0.001% of what the model can generate in normal situations.
It is not. You can use the idea of Mario, you cannot use the totality of Mario. For the AI to be able to use the idea of Mario, it will also ‘learn’ the totality of Mario in the process, as Mario is a collection of ideas that are extracted. But those ideas are stored separately so they can be individually prompted for. You can prompt it to make Mario, because like literally almost every person in society, they know what ideas make up Mario better than I can put to words here. If I hire a human artist to make me a “video game Italian plumber”, their first question to me would be “Oh, something like Mario?” and their second response will be “Oh I can’t do that, and you should not want to, because you don’t own Mario.”. Humans use AI, so they need to be the ones to give that second response.
Just like a kitchen knife can be used to stab someone, doesn’t mean we produce kitchen knives for stabbing people. Just because an AI can be used to infringe, does not mean that they are produced to infringe. Which is evidence by the vast majority of other ways that it can be used that don’t infringe, which is self evident after just tinkering around with it for a little while.
If it’s anything like the examples before, then the AI has definitely been prompted by the user to make infringing elements.
But anyways, to the question, you just don’t seem to grasp that collections of ideas can communicate copyright infringing material without being infringing on their own. It’s like arguing that if Paint or Photoshop knows about the color red that this is copyright infringing because it’s the same red that Mario uses. None of the ideas that make up Mario are infringing, and cannot be copyrighted. They are what the AI is designed to extract, not Mario as a totality.
You can definitely use AI to make an infringement machine by making it less likely to make leaps in ideas and just only combine the ideas it’s been taught on, which we as humans can do as well in the form of plagiarism and forgery. But if you’re going to be unethical why use an AI when you might as well just take the easy route directly with print screen or a photo. Two other technologies we didn’t ban for having this ability to capture copyrighted material, even if they far more blatantly copy the material.
This is where good AI usage deviates, because it instead tries to MAXIMIZE the amount of leaps and connections the AI makes for as little possibility to make something infringing. Even honest people trying to make new creative works sometimes have to change things because they might be too close to being infringing.
Not to mention patent laws are bullshit.
There are law offices that exist specifically to fuck with people over patent and copyright law.
There’s also cases where people use copyright and patent law to hold us back. I can’t find the article but some religious jerk patented connecting a sex toy to a computer via USB. Thankfully someone got around this law with bluetooth and cell phones. Otherwise I imagine the camgirl and LDR market for toys would’ve been hit with products 10 years sooner.
copyright*
Fixed. copyright*