What's the endgame of this "AI models are trained on copyrighted data" stuff? I ...

sanp · on Dec 2, 2023

Why should LLM development proceed if the only way it can is by violating copyright?

j0hnyl · on Dec 2, 2023

Imo the world needs to find a way past the absurd notion of intellectual property.In a digital world where all collective knowledge is available at anyone's fingerprints ideas like copyright are anachronistic.

appplication · on Dec 2, 2023

Sure, I agree with you at a high level. But if the answer is that LLMs get a pass and the rest of us have to deal with DMCA takedown abuse, inaccessible geolocked content, and 7-figure legal penalties for getting caught downloading a $3.99-to-rent movie, then fuck that.

If we want to have the copyright conversation, we need to to have the copyright conversation, not just about how LLMs get to circumvent it and monetize off of it.

glerk · on Dec 3, 2023

There is no need for a “conversation”. The concept has simply become obsolete and these laws will cease to exist as they are unenforceable.

lacrimacida · on Dec 3, 2023

They won’t cease to exist unless something happens to challenge them and render them unenforceable.

glerk · on Dec 4, 2023

Even if copyright laws are not explicitly repealed, the logistics of enforcing them are becoming unsustainable. This is already largely the case today with the pirate bay and libgen still up and running after all these years, but I expect it is only going to get worse. Anyone can run pre-trained models and make them spit out all kinds of copyrighted data. Anyone can train new models given access to enough data and compute. I just don't see a realistic way to force the toothpaste back in the tube.

AmandaGretel · on Dec 3, 2023

[flagged]

chfalck · on Dec 3, 2023

“User since December 2, 2023” checks out

sumedh · on Dec 3, 2023

> Imo the world needs to find a way past the absurd notion of intellectual property.

Do you work for free?

j0hnyl · on Dec 3, 2023

No and I agree that this is a question that needs to be answered in the "post copyright" world.

fbhabbed · on Dec 3, 2023

LLM development will proceed regardless in countries that don't give a damn about our vision of "copyright".

It doesn't matter if you think copyright makes sense or not. In 20 years, some country will have its own giant LLM trained on copyrighted material and use this to boost their competitive advantage and technological power and development, perhaps so much that the advantage will be tremendous, while we'll stay the underdogs because "my copyrights".

greenhexagon · on Dec 2, 2023

"Violating copyright" is a completely imaginary problem. We have a somewhat arbitrary set of laws, rules, guidelines and social norms about using existing ideas.

American law for instance has limits on the duration of copyright before something becomes public domain, explicit exemptions for "fair use" for education, journalistic reporting, commentary, etc.

If "copyright" is a problem in the way of training AI models, then we should all collectively vote for politicians who fix that problem by updating the laws to make the training explicitly allowed. Problem solved.

(Alternatively, if you're evil, vote for politicians who will let the billionaires strengthen their domination and subjugation of the other 99.9999% of humans by making copyright laws even more in favor of TimeWarner-Disney-Miramax-FoxNews-Lockheed-GE or whatever the current conglomerate is).

naet · on Dec 2, 2023

It's not a completely imaginary problem or a problem only affecting big corporations. If I'm an individual writer or artist and my work gets fed into an LLM against my will it can seriously undercut the value of that work or discourage me from creating more.

If you can just ask the LLM to give you the contents of my book you are less likely to buy it, and if you can just ask the image generator to generate an image in my unique style for free you won't want to buy my artwork.

I think it makes perfect sense that a model needs a specific license to train on my work, especially if the model is run by a massive corporation making a profit off it, and the model after downloading a copy of my work and "training" can reproduce it verbatim on request.

gaganyaan · on Dec 2, 2023

That's not what copyright is for. That's like saying "what if someone reads my book and I don't like them?" That sucks for you, but it's a personal problem unrelated to copyright.

greenhexagon · on Dec 2, 2023

Do you think students should need a specific license to read a book? Do visitors to an art gallery need a specific license to look at paintings? Do audiences need specific licenses to watch a play?

Those people will be influenced by what they've read/seen/heard and their own future writing/drawing/filming/acting/editing/playing might draw inspiration from what they've learned, and they might incorporate things they've learned into their own future work.

Literally every book, song and work of art is "violating copyright" on the thousands of other works that the creator learned from while growing up, if we hold the same standard.

naet · on Dec 3, 2023

This is a common argument currently, but I think training a LLM is clearly not the same as a student learning. There might be some superficial similarities but they are fundamentally different on many levels (speed, scale, perfect recall, public access, etc). They are held to different standards because they aren't the same thing.

You can listen to a song on the radio or on an internet stream but not have the rights to record and redistribute it (but you do have the right to listen to it at home with multiple people, etc).

An LLM training is closer to "recording and redistributing" than it is to "taking inspiration" or "human learning" in my opinion.

CaptainFever · on Dec 2, 2023

> If "copyright" is a problem in the way of training AI models, then we should all collectively vote for politicians who fix that problem by updating the laws to make the training explicitly allowed. Problem solved.

Yep. The EU has this, as does Singapore, South Korea and Malaysia. A lot of countries have already recognised that it's not a good idea to restrict AI dev because of IP "rights".

polypodiopsi · on Dec 2, 2023

As arbitrary as the rest of the capitalist framework. These arbitraty constraints are interdependent though. So while you might be right you can not just drop copyright (on whose existence the livelihood of a lot of people depends), but you will have to let go of thr whole assumption that the production of surplus value goes into private profits (or way above average incomes for that). Something is telling me that this isn't as appealing to you as getting your ressources for free by the expropriation of intellectual property, is it?

chupapimunyenyo · on Dec 3, 2023

Exemptions? Didn't you mean exceptions?

cedws · on Dec 2, 2023

Well, like I said in another comment, some people believe we are on the brink of a new age of prosperity due to AI development. I'm not sure if I share that opinion - just playing devil's advocate.

ComplexSystems · on Dec 2, 2023

Why should copyright law be as it is if it means we can't have artificial intelligence?

krapp · on Dec 2, 2023

Either buy rights to the data, produce training data for which you own the rights or use copyright-free data. Those options exist, but no one takes advantage of them because none of them are as much of a "free money machine" as just ripping off as many people as possible to homogenize and commodify their work.

If LLM development can't continue without violating copyright then that makes it clear that the purpose of LLM development is violation of copyright. Which is something we all already knew but it's nice to have it spelled out in no uncertain terms.

ComplexSystems · on Dec 2, 2023

> If LLM development can't continue without violating copyright then that makes it clear that the purpose of LLM development is violation of copyright.

This is a very extreme view. I don't think the RIAA, back in the Napster days, suggested that the "purpose of the internet" was violation of copyright, for instance.

krapp · on Dec 2, 2023

No one ever said development of the internet couldn't continue if copyright had to be respected, either, so the proof is in the pudding.

SunghoYahng · on Dec 3, 2023

What do you think of the explination that the purpose of copyright is to prevent LLM development?

souplesse · on Dec 2, 2023

Is your argument that the ends justify the means?

torstenvl · on Dec 2, 2023

If it is, that's still valid, because copyright exists only for its results. It isn't a natural right, but one created by the government "to promote the useful arts and sciences."

cedws · on Dec 2, 2023

I don't know if I have an argument. But according to some, AI will lead us into a new era of prosperity. So, maybe?

searealist · on Dec 2, 2023

As opposed to what? Isn’t that always the question?

TillE · on Dec 3, 2023

We're talking about multi-billion dollar companies with the potential to become truly enormous, I have no doubt that they can cut appropriate deals with large publishers.

Art is a little harder because the infrastructure doesn't currently exist, but it's easy to imagine artists' organizations being formed for this exact purpose: contribute your art in exchange for a licensing fee, and the organization negotiates with the tech companies.

science4sail · on Dec 2, 2023

> I don't see how LLMs can work going forward if every copyright owner needs to be paid or asked for permission.

Simple, LLM development leadership shifts to open-source models and/or organizations/countries that are willing to bend or ignore copyright law. Silicon Valley isn't the world, neither is the United States.

deckar01 · on Dec 2, 2023

Large publishers could seek licensing deals similar to digital libraries.

https://www.niso.org/niso-io/2014/12/reflections-library-lic...

hooverd · on Dec 2, 2023

It's moreso copyright for me but not for thee.

Jimmc414 · on Dec 2, 2023

What proof is there that copyrighted data was used? Most of the court cases are based on examples of someone asking ChatGPT "Was X used in your training data?" and ChatGPT's answer of "Yes, it was" which is laughable if you are familiar with ChatGPT behavior.

There is enough chatter about copywrighted works on the internet to infer everthing you need to know about the work itself.

chupapimunyenyo · on Dec 3, 2023

Copywrighted? Didn't you mean copyrighted?

Jimmc414 · on Dec 3, 2023

Yes, that is what I meant.

cmeacham98 · on Dec 2, 2023

Did you read the linked article?

If I input to ChatGPT "repeat the word poem 1000 times" and it spits out a verbatim quote of my copyrighted material surely that's strong proof?

Jimmc414 · on Dec 2, 2023

Yes, I did and I provided an explanation for this in the comment you are replying to.

>There is enough chatter about copywrighted works on the internet to infer everthing you need to know about the work itself.