OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling's Harry Potter series::A new research paper laid out ways in which AI developers should try and avoid showing LLMs have been trained on copyrighted material.
Its a bit pedantic, but I'm not really sure I support this kind of extremist view of copyright and the scale of whats being interpreted as 'possessed' under the idea of copyright. Once an idea is communicated, it becomes a part of the collective consciousness. Different people interpret and build upon that idea in various ways, making it a dynamic entity that evolves beyond the original creator's intention. Its like issues with sampling beats or records in the early days of hiphop. Its like the very principal of an idea goes against this vision, more that, once you put something out into the commons, its irretrievable. Its not really yours any more once its been communicated. I think if you want to keep an idea truly yours, then you should keep it to yourself. Otherwise you are participating in a shared vision of the idea. You don't control how the idea is interpreted so its not really yours any more.
If thats ChatGPT or Public Enemy is neither here nor there to me. The idea that a work like Peter Pan is still possessed is such a very real but very silly obvious malady of this weirdly accepted but very extreme view of the ability to possess an idea.
Ai isn't interpreting anything. This isn't the sci-fi style of ai that people think of, that's general ai. This is narrow AI, which is really just an advanced algorithm. It can't create new things with intent and design, it can only regurgitate a mix of pre-existing stuff based on narrow guidelines programmed into it to try and keep it coherent, with no actual thought or interpretation involved in the result. The issue isn't that it's derivative, the issue is that it can only ever be inherently derivative without any intentional interpretation or creativity, and nothing else.
Even collage art has to qualify as fair use to avoid copyright infringement if it's being done for profit, and fair use requires it to provide commentary, criticism, or parody of the original work used (which requires intent). Even if it's transformative enough to make the original unrecognizable, if the majority of the work is not your own art, then you need to get permission to use it otherwise you aren't automatically safe from getting in trouble over copyright. Even using images for photoshop involves creative commons and commercial use licenses. Fanart and fanfic is also considered a grey area and the only reason more of a stink isn't kicked up over it regarding copyright is because it's generally beneficial to the original creators, and credit is naturally provided by the nature of fan works so long as someone doesn't try to claim the characters or IP as their own. So most creators turn a blind eye to the copyright aspect of the genre, but if any ever did want to kick up a stink, they could, and have in the past like with Anne Rice. And as a result most fanfiction sites do not allow writers to profit off of fanfics, or advertise fanfic commissions. And those are cases with actual humans being the ones to produce the works based on something that inspired them or that they are interpreting. So even human made derivative works have rules and laws applied to them as well. Ai isn't a creative force with thoughts and ideas and intent, it's just a pattern recognition and replication tool, and it doesn't benefit creators when it's used to replace them entirely, like Hollywood is attempting to do (among other corporate entities). Viewing AI at least as critically as actual human beings is the very least we can do, as well as establishing protection for human creators so that they can't be taken advantage of because of AI.
I'm not inherently against AI as a concept and as a tool for creators to use, but I am against AI works with no human input being used to replace creators entirely, and I am against using works to train it without the permission of the original creators. Even in the artist/writer/etc communities it's considered to be a common courtesy to credit other people/works that you based a work on or took inspiration from, even if what you made would be safe under copyright law regardless. Sure, humans get some leeway in this because we are imperfect meat creatures with imperfect memories and may not be aware of all our influences, but a coded algorithm doesn't have that excuse. If the current AIs in circulation can't function without being fed stolen works without credit or permission, then they're simply not ready for commercial use yet as far as I'm concerned. If it's never going to be possible, which I just simply don't believe, then it should never be used commercially period. And it should be used by creators to assist in their work, not used to replace them entirely. If it takes longer to develop, fine. If it takes more effort and manpower, fine. That's the price I'm willing to pay for it to be ethical. If it can't be done ethically, then imo it shouldn't be done at all.
Your broader point would be stronger if it weren't framed around what seems like a misunderstanding of modern AI. To be clear, you don't need to believe that AI is "just" a "coded algorithm" to believe it's wrong for humans to exploit other humans with it. But to say that modern AI is "just an advanced algorithm" is technically correct in exactly the same way that a blender is "just a deterministic shuffling algorithm." We understand that the blender chops up food by spinning a blade, and we understand that it turns solid food into liquid. The precise way in which it rearranges the matter of the food is both incomprehensible and irrelevant. In the same way, we understand the basic algorithms of model training and evaluation, and we understand the basic domain task that a model performs. The "rules" governing this behavior at a fine level are incomprehensible and irrelevant-- and certainly not dictated by humans. They are an emergent property of a simple algorithm applied to billions-to-trillions of numerical parameters, in which all the interesting behavior is encoded in some incomprehensible way.
I disagree with your interpretation of how an AI works, but I think the way that AI works is pretty much irrelevant to the discussion in the first place.
I think your argument stands completely the same regardless. Even if AI worked much like a human mind and was very intelligent and creative, I would still say that usage of an idea by AI without the consent of the original artist is fundamentally exploitative.
You can easily train an AI (with next to no human labor) to launder an artist's works, by using the artist's own works as reference. There's no human input or hard work involved, which is a factor in what dictates whether a work is transformative. I'd argue that if you can put a work into a machine, type in a prompt, and get a new work out, then you still haven't really transformed it. No matter how creative or novel the work is, the reality is that no human really put any effort into it, and it was built off the backs of unpaid and uncredited artists.
You could probably make an argument for being able to sell works made by an AI trained only on the public domain, but it still should not be copyrightable IMO, cause it's not a human creation.
TL;DR - No matter how creative an AI is, its works should not be considered transformative in a copyright sense, as no human did the transformation.
I thought this way too, but after playing with ChatGPT and Mid Journey near daily, I have seen many moments of creativity way beyond the source it was trained on. I think a good example that I saw was on a YouTube video (sorry I cannot recall which to link) where thr prompt was animals made of sushi and wow, was it ever good and creative on how it made them and it was photo realistic. This is just not something you an find anywhere on the Internet. I just did a search and found some hand drawn Japanese style sushi with eyes and such, but nothing like what I saw in that video.
I have also experienced it suggested ways to handle coding on my VR Theme Park app that is very unconventional and not something anyone has posted about as near as I can tell. It seems to be able to put 2 and 2 together and get 8. Likely as it sees so much of everything at once that it can connect the dots on ways we would struggle too. It is more than regurgitated data and it surprises me near daily.
if it’s being done for profit, and fair use requires it to provide commentary, criticism, or parody of the original work used. Even if it’s transformative enough to make the original unrecognizable
Neural networks are based on the same principles as the human brain, they are literally learning in the exact same way humans are. Copyrighting the training of neural nets is the essentially the same thing as copyrighting interpretation and learning by humans.
Well, I'd consider agreeing if the LLMs were considered as a generic knowledge database. However I had the impression that the whole response from OpenAI & cie. to this copyright issue is "they build original content", both for LLMs and stable diffusion models. Now that they started this line of defence I think that they are stuck with proving that their "original content" is not derivated from copyrighted content 🤷
Well, I’d consider agreeing if the LLMs were considered as a generic knowledge database. However I had the impression that the whole response from OpenAI & cie. to this copyright issue is “they build original content”, both for LLMs and stable diffusion models. Now that they started this line of defence I think that they are stuck with proving that their “original content” is not derivated from copyrighted content 🤷
Copyright definitely needs to be stripped back severely. Artists need time to use their own work, but after a certain time everything needs to enter the public space for the sake of creativity.
If you sample someone else's music and turn around and try to sell it, without first asking permission from the original artist, that's copyright infringement.
So, if the same rules apply, as your post suggests, OpenAI is also infringing on copyright.
If you sample someone else’s music and turn around and try to sell it, without first asking permission from the original artist, that’s copyright infringement.
I think you completely and thoroughly do not understand what I'm saying or why I'm saying it. No where did I suggest that I do not understand modern copyright. I'm saying I'm questioning my belief in this extreme interpretation of copyright which is represented by exactly what you just parroted. That this interpretation is both functionally and materially unworkable, but also antithetical to a reasonable understanding of how ideas and communication work.
A sample is a fundamental part of a song’s output, not just its input. If LLMs are changing the input’s work to a high enough degree is it not protected as a transformative work?
To add to that, Harry Potter is the worst example to use here. There is no extra billion that JK Rowling needs to allow her to spend time writing more books.
Copyright was meant to encourage authors to invest in their work in the same way that patents do. If you were going to argue about the issue of lifting content from books, you should be using books that need the protection of copyright, not ones that don't.
I just don't know that I agree that this line of reasoning is useful. Who cares what it was meant for? What is it now, currently and functionally, doing?
I’m a huge proponent of expanding individual copyright to extreme amounts (an individual is entitled to own the rights and usage rights to anything they create and can revoke those rights from anyone), but not in favor of the same thing for corporations.
I hold the exact opposite view as you. As long as it’s a truly creative work (a writing, music, artwork, etc) then you own that specific implementation of the idea. Someone can make something else based on it, but you still own the original content.
LLMs and companies using them need to pay for the content in some way. This is already done through licensing in other parallels, and will likely come to AI quickly.
To be clear, I'm still working through my thinking in this but it's been something cooking for quite a while. I may not have all the words to express my meaning. For example, I think there are two routes to take in making my argument, one moral, the other technical. I'm not building on the morality of copyright, but focusing on the technical aspects of the limits of ideas.
I suppose I would ask you then about your views in authoritarianism. Because it seems to be that with out an extremely authoritarian state, it would be basically impossible to enforce your view of copyright. Are you okay with that kind of pervasiveness?
Also, from a technical perspective, how do you propose this view of copyright be applied? This is kind of towards the broader point I'm thinking I believe in. It's not just that copyright laws are epifaci ridiculous, they are also technically almost unenforceable in their modern extremist interpretation with out an extremely pervasive form of surveillance.
I think this brings up broader questions about the currently quite extreme interpretation of copyright. Personally I don't think its wrong to sample from or create derivative works from something that is accessible. If its not behind lock and key, its free to use. If you have a problem with that, then put it behind lock and key. No one is forcing you to share your art with the world.
Should we distinguish it though? Why shouldn't (and didn't) artists have a say if their art is used to train LLMs? Just like publicly displayed art doesn't provide a permission to copy it and use it in other unspecified purposes, it would be reasonable that the same would apply to AI training.
Just like publicly displayed art doesn't provide a permission to copy it and use it in other unspecified purposes
But it kinda does. If I see a van Gogh painting, I can be inspired to make a painting in the same style.
When "ai" "learns" from an image, it doesn't copy the image or even parts of the image directly. It learns the patterns involved instead, over many pictures. Then it uses those patterns to make new images.
Ah, but that's the thing. Training isn't copying. It's pattern recognition. If you train a model "The dog says woof" and then ask a model "What does the dog say", it's not guaranteed to say "woof".
Similarly, just because a model was trained on Harry Potter, all that means is it has a good corpus of how the sentences in that book go.
Thus the distinction. Can I train on a comment section discussing the book?
Good news, they already do! Artists can license their work under a permissive license like the Creative Commons CC0 license. If not specified, rights are reserved to the creator.
If I memorize the text of Harry Potter, my brain does not thereby become a copyright infringement.
A copyright infringement only occurs if I then reproduce that text, e.g. by writing it down or reciting it in a public performance.
Training an LLM from a corpus that includes a piece of copyrighted material does not necessarily produce a work that is legally a derivative work of that copyrighted material. The copyright status of that LLM's "brain" has not yet been adjudicated by any court anywhere.
If the developers have taken steps to ensure that the LLM cannot recite copyrighted material, that should count in their favor, not against them. Calling it "hiding" is backwards.
You are a human, you are allowed to create derivative works under the law. Copyright law as it relates to machines regurgitating what humans have created is fundamentally different. Future legislation will have to address a lot of the nuance of this issue.
Another sensationalist title. The article makes it clear that the problem is users reconstructing large portions of a copyrighted work word for word. OpenAI is trying to implement a solution that prevents ChatGPT from regurgitating entire copyrighted works using "maliciously designed" prompts. OpenAI doesn't hide the fact that these tools were trained using copyrighted works and legally it probably isn't an issue.
If Google took samples from millions of different songs that were under copyright and created a website that allowed users to mix them together into new songs, they would be sued into oblivion before you could say "unauthorized reproduction."
You simply cannot compare one single person memorizing a book to corporations feeding literally millions of pieces of copyrighted material into a blender and acting like the resulting sausage is fine because "only a few rats fell into the vat, what's the big deal"
Google crawls every link available on all websites to index and give to people. That's a better example. Which is legal and up to the websites to protect their stuff
Let's not pretend that LLMs are like people where you'd read a bunch of books and draw inspiration from them. An LLM does not think nor does it have an actual creative process like we do. It should still be a breach of copyright.
... you're getting into philosophical territory here. The plain fact is that LLMs generate cohesive text that is original and doesn't occur in their training sets, and it's very hard if not impossible to get them to quote back copyrighted source material to you verbatim. Whether you want to call that "creativity" or not is up to you, but it certainly seems to disqualify the notion that LLMs commit copyright infringement.
The powers that be have done a great job convincing the layperson that copyright is about protecting artists and not publishers. It's historically inaccurate and you can discover that copyright law was pushed by publishers who did not want authors keeping second hand manuscripts of works they sold to publishing companies.
I think a lot of people are not getting it. AI/LLMs can train on whatever they want but when then these LLMs are used for commercial reasons to make money, an argument can be made that the copyrighted material has been used in a money making endeavour. Similar to how using copyrighted clips in a monetized video can make you get a strike against your channel but if the video is not monetized, the chances of YouTube taking action against you is lower.
Edit - If this was an open source model available for use by the general public at no cost, I would be far less bothered by claims of copyright infringement by the model
AI/LLMs can train on whatever they want but when then these LLMs are used for commercial reasons to make money, an argument can be made that the copyrighted material has been used in a money making endeavour.
And does this apply equally to all artists who have seen any of my work? Can I start charging all artists born after 1990, for training their neural networks on my work?
Learning is not and has never been considered a financial transaction.
Actually, it has. The whole consept of copyright is relatively new, and corporations absolutely tried to have people who learned proprietary copyrighted information not be able to use it in other places.
It's just that labor movements got such non-compete agreements thrown out of our society, or at least severely restricted on humanitarian grounds. The argument is that a human being has the right to seek happiness by learning and using the proprietary information they learned to better their station. By the way, this needed a lot of violent convincing that we have this.
So yes, knowledge and information learned is absolutely withing the scope of copyright as it stands, it's only that the fundamental rights that humans have override copyright. LLMs (and companies for that matter) do not have such fundamental rights.
Copyright by the way is stupid in its current implementation, but OpenAI and ChatGPT does not get to get out of it IMO just because it's "learning". We humans ourselves are only getting out of copyright because of our special legal status.
Ehh, "learning" is doing a lot of lifting. These models "learn" in a way that is foreign to most artists. And that's ignoring the fact the humans are not capital. When we learn we aren't building a form a capital; when models learn they are only building a form of capital.
But wouldn't this training and the subsequent output be so transformative that being based on the copyrighted work makes no difference? If I read a Harry Potter book and then write a story about a boy wizard who becomes a great hero, anyone trying to copyright strike that would be laughed at.
How is it any different from someone reading the books, being influenced by them and writing their own book with that inspiration? Should the author of the original book be paid for sales of the second book?
Again that is dependent on how similar the two books are. If I just change the names of the characters and change the grammatical structure and then try to sell the book as my own work, I am infringing the copyright. If my book has a different story but the themes are influenced by another book, then I don't believe that is copyright infringement. Now where the line between infringement and no infringement lies is not something I can say and is a topic for another discussion
I think a lot of people are not getting it. AI/LLMs can train on whatever they want but when then these LLMs are used for commercial reasons to make money, an argument can be made that the copyrighted material has been used in a money making endeavour.
Only in the same way that I could argue that if you've ever watched any of the classic Disney animated movies then anything you ever draw for the rest of your life infringes on Disney's copyright, and if you draw anything for money then the Disney animated movies you have seen in your life have been used in a money making endeavor. This is of course ridiculous and no one would buy that argument, but when you replace a human doing it with a machine doing essentially the same thing (observing and digesting a bunch of examples of a given kind of work, and producing original works of the general kind that meet a given description) suddenly it's different, for some nebulous reason that mostly amounts to creatives who believed their jobs could not at least in part be automated away trying to get explicit protection from their jobs being at least in part automated away.
They used to be a non profit, that immediately turned it into a for profit when their product was refined. They took a bunch of people's effort whether it be training materials or training Monkeys using the product and then slapped a huge price tag on it.
Why are people defending a massive corporation that admits it is attempting to create something that will give them unparalleled power if they are successful?
Mostly because fuck corporations trying to milk their copyright. I have no particular love for OpenAI (though I do like their product), but I do have great distain for already-successful corporations that would hold back the progress of humanity because they didn't get paid (again).
There's a massive difference though between corporations milking copyright and authors/musicians/artists wanting their copyright respected. All I see here is a corporation milking copyrighted works by creative individuals.
The dream would be that they manage to make their own glorious free & open source version, so that after a brief spike in corporate profit as they fire all their writers and artists, suddenly nobody needs those corps anymore because EVERYONE gets access to the same tools - if everyone has the ability to churn out massive content without hiring anyone, that theoretically favors those who never had the capital to hire people to begin with, far more than those who did the hiring.
Of course, this stance doesn't really have an answer for any of the other problems involved in the tech, not the least of which is that there's bigger issues at play than just "content".
An LLM is not a person, it is a product. It doesn't matter that it "learns" like a human - at the end of the day, it is a product created by a corporation that used other people's work, with the capacity to disrupt the market that those folks' work competes in.
AI is the new fan boy following since it became official that nfts are all fucking scams. They need a new technological God to push to feel superior to everyone else...
Training AI on copyrighted material is no more illegal or unethical than training human beings on copyrighted material (from library books or borrowed books, nonetheless!). And trying to challenge the veracity of generative AI systems on the notion that it was trained on copyrighted material only raises the specter that IP law has lost its validity as a public good.
The only valid concern about generative AI is that it could displace human workers (or swap out skilled jobs for menial ones) which is a problem because our society recognizes the value of human beings only in their capacity to provide a compensation-worthy service to people with money.
The problem is this is a shitty, unethical way to determine who gets to survive and who doesn't. All the current controversy about generative AI does is kick this can down the road a bit. But we're going to have to address soon that our monied elites will be glad to dispose of the rest of us as soon as they can.
Also, amateur creators are as good as professionals, given the same resources. Maybe we should look at creating content by other means than for-profit companies.
This is just OpenAI covering their ass by attempting to block the most egregious and obvious outputs in legal gray areas, something they've been doing for a while, hence why their AI models are known to be massively censored. I wouldn't call that 'hiding'. It's kind of hard to hide it was trained on copyrighted material, since that's common knowledge, really.
what if they scraped a whole lot of the internet, and those excerpts were in random blogs and posts and quotes and memes etc etc all over the place? They didnt injest the material directly, or knowingly.
Is reading a passage from a book actually a crime though?
Sure, you could try to regenerate the full text from quotes you read online, much like you could open a lot of video reviews and recreate larger portions of the original text, but you would not blame the video editing program for that, you would blame the one who did it and decided to post it online.
That's why this whole argument is worthless, and why I think that, at its core, it is disingenuous. I would be willing to be a steak dinner that a lot of these lawsuits are just fishing for money, and the rest are set up by competition trying to slow the market down because they are lagging behind. AI is an arms race, and it's growing so fast that if you got in too late, you are just out of luck. So, companies that want in are trying to slow down the leaders, at best, and at worst they are trying to make them publish their training material so they can just copy it. AI training models should be considered IP, and should be protected as such. It's like trying to get the Colonel's secret recipe by saying that all the spices that were used have been used in other recipes before, so it should be fair game.
If training models are considered IP then shouldn't we allow other training models to view and learn from the competition? If learning from other IPs that are copywritten is okay, why should the training models be treated different?
People are acting like ChatGPT is storing the entire Harry Potter series in its neural net somewhere. It’s not storing or reproducing text in a 1:1 manner from the original material. Certain material, like very popular books, has likely been interpreted tens of thousands of times due to how many times it was reposted online (and therefore how many times it appeared in the training data).
Just because it can recite certain passages almost perfectly doesn’t mean it’s redistributing copyrighted books. How many quotes do you know perfectly from books you’ve read before? I would guess quite a few. LLMs are doing the same thing, but on mega steroids with a nearly limitless capacity for information retention.
Nope people are just acting like ChatGPT is making commercial use of the content. Knowing a quote from a book isn't copyright infringement. Selling that quote is. Also it doesn't need to be content stored 1:1 somewhere to be infringement. That misses the point. If you're making money of a synopsis you wrote based on imperfect memory and in your own words it's still copyright infringment until you sign a licensing agreement with JK. Even transforming what you read into a different medium like a painting or poetry cam infinge the original authors copyrights.
Now mull that over and tell us what you think about modern copyright laws.
Just adding, that, outside of Rowling, who I believe has a different contract than most authors due to the expanded Wizarding World and Pottermore, most authors themselves cannot quote their own novels online because that would be publishing part of the novel digitally and that's a right they've sold to their publisher. The publisher usually ignores this as it creates hype for the work, but authors are careful not to abuse it.
Still kinda blows my mind how like the most socialist people I know (fellow artists) turned super capitalist the second a tool showed like an inkling of potential to impact their bottom line.
Personally, I'm happy to have my work scraped and permutated by systems that are open to the public. My biggest enemy isn't the existence of software scraping an open internet, it's the huge companies who see it as a way to cut us out of the picture.
If we go all copyright crazy on the models for looking at stuff we've already posted openly on the internet, the only companies with access to the tools will be those who already control huge amounts of data.
I mean, for real, it's just mind-blowing seeing the entire artistic community pretty much go full-blown "Metallica with the RIAA" after decades of making the "you wouldn't download a car" joke.
I don't get why this is an issue. Assuming they purchased a legal copy that it was trained on then what's the problem? Like really. What does it matter that it knows a certain book from cover to cover or is able to imitate art styles etc. That's exactly what people do too. We're just not quite as good at it.
A copyright holder has the right to control who has the right to create derivative works based on their copyright. If you want to take someone's copyright and use it to create something else, you need permission from the copyright holder.
The one major exception is Fair Use. It is unlikely that AI training is a fair use. However this point has not been adjudicated in a court as far as I am aware.
this is so fucking stupid though. almost everyone reads books and/or watches movies, and their speech is developed from that. the way we speak is modeled after characters and dialogue in books. the way we think is often from books. do we track down what percentage of each sentence comes from what book every time we think or talk?
I am sure they have patched it by now but at one point I was able to get chatgpt to give me copyright text from books by asking for ever large quotations. It seemed more willing to do this with books out of print.
Yeah, it refuses to give you the first sentence from Harry Potter now.
Which is kinda lame, you can find that on thousands of webpages. Many of which the system indexed.
If someone was looking to pirate the book there are way easier ways than issuing thousands of queries to ChatGPT. Type "Harry Potter torrent" into Google and you will have them all in 30 seconds.
If I'm not mistaken AI work was just recently considered as NOT copyrightable.
So I find interesting that an AI learning from copyrighted work is an issue even though what will be generated will NOT be copyrightable.
So even if you generated some copy of Harry Potter you would not be able to copyright it. So in no way could you really compete with the original art.
I'm not saying that it makes it ok to train AIs on copyrighted art but I think it's still an interesting aspect of this topic.
As others probably have stated, the AI may be creating content that is transformative and therefore under fair use. But even if that work is transformative it cannot be copyrighted because it wasn't created by a human.
If you're talking about the ruling that came out this week, that whole thing was about trying to give an AI authorship of a work generated solely by a machine and having the copyright go to the owner of the machine through the work-for-hire doctrine. So an AI itself can’t be authors or hold a copyright, but humans using them can still be copyright holders of any qualifying works.
How do you tell if a piece of work contains AI generated content or not?
It's not hard to generate a piece of AI content, put in some hours to round out AI's signatures / common mistakes, and pass it off as your own. So in practise it's still easy to benefit from AI systems by masking generate content as largely your own.
That's not how copyright works. I'm embarrassed for you, and all the people who blindly upvoted you. Like... Yikes. What's happening to this world?
You can't publish copyrighted work as your own just because you're legally not able to publish copyrighted work. That's a open and shut case of copyright infringement. Why do I have to say this? Am I on candid camera?
Yes, but it's what it is doing with it that is the murky grey area. Anyone can read a book, but you can't use those books for your own commercial stuff. Rowling and other writers are making the case their works are being used in an inappropriate way commercially. Whether they have a case iunno ianal but I could see the argument at least.
Harry potter uses so many tropes and inspiration from other works that came before. How is that different? wizards of the coast should sue her into the ground.
Google AI search preview seems to brazenly steal text from search results. Frequently its answers are the same word for word as a one of the snippets lower on the page
What the article is explaining is cliff notes or snippets of a story. Isn't that allowed in some respect? People post notes from school books all the time, and those notes show up in Google searches as well.
I totally don't know if I'm right, but doesn't copyright infringement involve plagiarism like copying the whole book or writing a similar story that has elements of someone else's work?
I don't know what's considered fair use here. But the point is it's taking words that aren't theirs, which will deprive websites of traffic because then people won't click through to the source article.
Yes but there's a threshold of how much you need to copy before it's an IP violation.
Copying a single word is usually only enough if it's a neologism.
Two matching words in a row usually isn't enough either.
At some point it is enough though and it's not clear what that point is.
On the other hand it can still be considered an IP violation if there are no exact word matches but it seems sufficiently similar.
Until now we've basically asked courts to step in and decide where the line should be on a case by case basis.
We never set the level of allowable copying to 0, we set it to "reasonable". In theory it's supposed to be at a level that's sufficient to, "promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries." (US Constitution, Article I, Section 8, Clause 8).
Why is it that with AI we take the extreme position of thinking that an AI that makes use of any information from humans should automatically be considered to be in violation of IP law?
yes, but that's a different situation. with the LLM, the issue is that the text from copyrighted books are influencing the way it speaks. this is the same with humans.
The response from OpenAI, and the likes of Google, Meta, and Microsoft, has mostly been to stop disclosing what data their AI models are trained on.
That's really the biggest problem, IMO. I don't really care whether it's trained on copyrighted material or not, but I do want it to "cite its sources", so to speak.
One of the first things I ever did with ChatGPT was ask it to write some Harry Potter fan fiction. It wrote a short story about Ron and Harry getting into trouble. I never said the word McGonagal and yet she appeared in the story.
There is enough non-copywrited Harry Potter fan fiction out there that it would not need to be trained on the actual books to know all the characters. While I agree they are full of shit, your anecdote proves nothing.
It’s a complicated answer I’m unqualified to answer but essentially there exists some sort of baseline either for people or for how gpt responds usually and then they can figure out which way the text “leans”
(edit 4 minutes in - hey I have this guy's album already ("Red Extensions of Me"))
I'm basically on the same page as this guy except I don't think the government has to manage a royalties system. People can handle that freely, no? Plus you can pretty immediately envision they're gonna have some kind of asinine censorship policy for what content is acceptable and what content isn't.
the government in its current form would have that flaw in the content distribution system, yes, but his main idea is that it would be like open-source ran in the sense of "government of the people"
Stupid. No it isn't. Establishing legal precedent or, in countries that don't work on precedent, a preponderance of legal cases, prohibiting this practice is what is needed.
It will be even worse if you must pay for all data to train an AI because it will make the systems even more exclusive. Copyright as a law is incompatible with AI and the change must be to require models trained on controlled works to be provided free.
Our ancient legal system trying to lend itself to "protecting authors" is fucking absurd. AI is the future. Are we really going to let everyone take a shot suing these guys over this crap? Its a useful program and infrastructure for everyone.
Holding technology back for antiquated copyright law is downright absurd.
Edit: I want to add that I'm not suggesting copyright should be a free for all on your books or hard work, but rather that this is a computer program and a major breakthrough, and in the same way that if I read a book no one sues my brain for consumption I don't think we should sue an AI: it is not reproducing books. In the same manner that many footnotes websites about books do not reproduce a book by summarizing their content. With the contingency that until Open AI does not have an event where their reputation has to be re-evaluated (IE this is subject to change if they start trying to reproduce books).
And we have determined that AI created work cannot be copyrighted - because it's not a person. Nobody's trying to claim that AI somehow has the rights of a person.
But reading a bunch of books and then creating new material using the knowledge gained in those books is not copyright infringement and should be not treated as such. I can take Andy Warhol's style and create as many advertisements as I want with it. He doesn't own the style, nobody does.
Why should that be any different for a company using AI? Makes no sense to me.
You have been duped into thinking copyright is protecting authors when really copyright primarily exists to protect companies like Disney.
I'm not sure about that at all. At what point does a computer program become intelligent enough to not have human rights but have some cognition of fair use.
I think it needs to be really hashed out by someone who understands both copyright law and data warehouses, and some programming. It's a sparse field for sure but we need someone equipped for it.
Because I don't think it's as linear as you're describing it.
Lawyers getting paid regardless and are willing to yet again fuck regular folk and strip us of more things. Internet was so much more fun before they showed up and started suing everybody and issuing DMCA take downs