Copyright violation is not stealing, and training is not copyright violation (it...

inglor_cz · 2025-10-26T15:56:58 1761494218

I think the concerning problem is when the LLM reproduces some copyrighted code verbatim, and the user doesn't even stand a chance to know it.

1gn15 · 2025-10-26T15:58:44 1761494324

Yes, but that's not what the grandparent comment was talking about.

isodev · 2025-10-26T16:29:45 1761496185

If I’m the grandparent comment, it was a big part of what I mean. Stolen/Unknown content goes in for training, verbatim or very close “inspired by” code comes out and there is no way to verify the source - “violation as a service”.

fluidcruft · 2025-10-26T18:01:10 1761501670

Verbatim dumping is one thing but otherwise this seems closer to the issue of plagiarism than copyright. If someone studies the Linux kernel and then builds a new kernel that follows some of the design decisions and idioms that's not really copyright infringement.

The bigger issue (spiritually anyway) seems to be the need to develop free software LLM tools the same way FSF needed to develop free compilers. That's what's going to keep users from being able to adapt and control their machines. The issue is more ecological that programmers equipped with LLM are likely much more productive at creating and modifying code.

Some of the rest seems more like saying that anyone who studies GCC internals is forever tainted and must write copyleft code for life which seems laughable to me. Again this is more a topic of plagiarism than copyright which are fairly similar but actually different and not as clear cut.

isodev · 2025-10-26T19:03:04 1761505384

> more a topic of plagiarism than copyright

You’re right, in the context of a technical legal interpretation they’re different. In the context of right or wrong, they amount to the same.

> anyone who studies GCC internals

LLMs are not a someone, they’re more like … the indigo printout of some text or design, you then use to make a scrapbook to be mass produced for profit. Very different situation.

When the AI bubble pops, I hope we will have some equalisation back to something more ethical.

fluidcruft · 2025-10-26T20:04:08 1761509048

I don't know... there's a pretty clear difference between copyright and a say a utility patent or trade secret. The right and wrong in FSF isn't about labor, it's about control over machines and ability to modify. Free software has never tried to control the community using patents and trade secrets and in general are rather hostile to them. In fact its fairly contemptuous of copyright and uses it from a purely utilitarian perspective. And frankly FSF is not opposed to commercial software. They're opposed to users being unable to modify machines and software that they are using. That's the core of the ethics. See the origins in that damn printer firmware RMS did battle with.

But I also disagree in general about LLMs. LLMs are statistical text models but the general concept of what if there were an "AI" that wasn't a LLM and was trained on open source software is the same at the end of the day. I think whether or not LLM are intelligent or equivalent to humans is a red herring. There's no reason to not consider the implications of machines that are indistinguishable or even superior to human programmers. Particularly if we're discussing ethics getting lost in implementation details seems like a distraction and then all the derived ethics gets thrown out after the next innovation.

tjr · 2025-10-26T19:44:06 1761507846

LLMs are not a someone

This is in line with my disagreement over the fair use rulings. Most people who published works that have been used to train AI systems, created those works and published them for other people to consume and benefit from, not for proprietary software systems to consume and benefit from. The existing licenses and laws did not account for this; nobody was anticipating it.

CamperBob2 · 2025-10-26T17:17:44 1761499064

When that happens, it's because the code was trivial enough to be compressed to a minuscule handful of bits... either because it literally is trivial, or because it's common enough to have become part of our shared lexicon.

As a society, we don't benefit from copyright maximalism, despite how trendy it is around here all of a sudden. See also Oracle v. Google.

thesz · 2025-10-26T18:43:19 1761504199

Quake's sqrt approximation is not trivial and is not common.

[1] https://www.reddit.com/r/programming/comments/oc9qj1/copilot...

CamperBob2 · 2025-10-26T18:47:38 1761504458

(Shrug) It's a math trick, documented by Abrash among others and very heavily discussed on forums such as this one. And it didn't originate in the Quake codebase. Like much IEEE754 hackery, it goes back to the father of IEEE754 himself, William Kahan.

Nobody benefits from a law that says that LLMs can't regurgitate the Quake sqrt() approximation. If that's what the law actually says, which it isn't.

blibble · 2025-10-26T17:47:24 1761500844

> it's already been ruled as fair use, multiple times

most countries don't have a concept of fair use

but they nearly all have copyright law

quantummagic · 2025-10-26T18:12:45 1761502365

That fact in itself is a worse injustice than anything the LLM companies are doing. At the very least, it should be open to use in reporting, parody, and critique. Having no concept of such fair-use is oppressive and stifling.

Hamuko · 2025-10-26T18:35:29 1761503729

Fair use is not the only way to allow critique and/or parody.

quantummagic · 2025-10-26T21:03:01 1761512581

What are you talking about? You can call it whatever you want, but it amounts to fair-use if you're allowed to use something for the purposes of critique and/or parody.

Hamuko · 2025-10-26T21:04:54 1761512694

That's not how it works. Specific carveouts in copyright are not the same thing as fair use.

quantummagic · 2025-10-26T21:11:37 1761513097

You're playing word games. The point is that any system that has no concept of fair-use, no allowance for reasonable usage of copyrighted works, except where explicitly granted by the copyright holder, is inherently unjust and stifling to free expression. How such allowances are specified in law, is irrelevant pedantry. What matters is that the allowances are afforded.

Hamuko · 2025-10-26T21:15:18 1761513318

I implore you to not use terms like "fair use" if you do not understand what they mean and don't care to find out. Because it is not a general concept like "free speech". It's a very specific legal doctrine.

quantummagic · 2025-10-26T21:27:09 1761514029

You're pretending the concept is difficult to understand. You're pretending that there is some protected definition of "fair use" that is more complicated than people being granted the right to use a copyrighted work irrespective of the copyright holder's wishes.

That's all i've been saying from the start, and your pedantry isn't helpful. It doesn't make you look smart. You haven't added any value or illumination to the conversation.

Hamuko · 2025-10-27T05:13:49 1761542029

Once again: that is not what the term means and you should not use it if you do not understand what it means. There is a very established definition for it, and for some reason you keep pretending there isn’t without a shred of evidence. Fair dealing also grants people "the right to use a copyrighted work irrespective of the copyright holder's wishes" and it's separate of fair use.

nradov · 2025-10-26T19:49:37 1761508177

Why should we care about conflicting IP laws in other countries? Most of them have no effective means of extraterritorial enforcement.

blibble · 2025-10-27T00:01:25 1761523285

presumably you want to be able to export your services to the 95% of the world's population lives outside of the reach of the united states regime

if not there's plenty of competitors that will

nradov · 2025-10-27T20:18:08 1761596288

Now adjust that percentage for profit potential. On average, customers within US legal jurisdiction tend to be the most profitable.

blibble · 2025-10-27T20:38:49 1761597529

> On average, customers within US legal jurisdiction tend to be the most profitable.

don't worry, the orange one is fixing that problem too

thesz · 2025-10-26T18:38:44 1761503924

One can train model with copyrighted code as it is fair use, fair enough.

Are there any rulings about use of code generated by model trained on copyrighted code?

I believe distinction is clear.

matheusmoreira · 2025-10-26T17:20:00 1761499200

Yeah, copyright infringement isn't stealing, copyright shouldn't even exist to begin with.

I just think it's especially asinine how corporations are perfectly willing to launder copyrighted works via LLMs when it's profitable to do so. We have to perpetually pay them for their works and if we break their little software locks it's felony contempt of business model, but they get to train their AIs on our works and reproduce them infinitely and with total impunity without paying us a cent.

It's that "rules for thee but not for me" nonsense that makes me reach such extreme logical conclusions that I feel empathy for terrorists.

wakawaka28 · 2025-10-26T19:13:54 1761506034

Your views are contradictory. Copyright shouldn't exist, but the businesses infringing on it are the bad ones?

>We have to perpetually pay them for their works and if we break their little software locks it's felony contempt of business model

You don't have to pay them, or break their restrictions.

>but they get to train their AIs on our works and reproduce them infinitely and with total impunity without paying us a cent.

You don't need to allow this either. Unfortunately open-source code is necessarily public.

>It's that "rules for thee but not for me" nonsense that makes me reach such extreme logical conclusions that I feel empathy for terrorists.

The way LLMs use code is fundamentally different from wholesale copying. If someone read your code and paraphrased it and tweaked it, it would be a completely new work not subject to the original copyright. At least it would be really hard to get a court to regard it as an infringement. This is like what LLMs do.

matheusmoreira · 2025-10-26T20:27:29 1761510449

> Your views are contradictory.

How is it contradictory? Tell it to the corporations who defend copyright for you and public domain fair use for themselves. If they were honest, they'd abolish copyright straight up instead of creating this idiotic caste system.

> Copyright shouldn't exist, but the businesses infringing on it are the bad ones?

Yes. Copyright shouldn't exist to begin with, but since it does, one would expect the corporations to work within the legal framework they themselves created and lobbied so heavily for. One would expect them to reap the consequences of their actions and be bound by the exact same limitations they seek to impose on us.

It is absolutely asinine to watch them make trillions of dollars by breaking their own rules while simultaneously pretendinf that nothing is happening and insisting that you mortal citizen must still abide by the same rules they are breaking.

The sheer dishonesty of it makes me sick to my core.

> If someone read your code and paraphrased it and tweaked it, it would be a completely new work not subject to the original copyright.

Derivative work.

I was once told that corporate programmers are warned by legal not to even read AGPLv3 source code, lest it subconsciously infect their thought processes and the final result. This is also the reason we have clean room reverse engineering where one team produces documentation and another uses it to reimplement the thing. Isolating minds from the copyrighted inputs is the whole point of it. All of this is risk management meant to disallow even the mere possibility that a derivative work was created in the process.

There is absolutely no reason to believe LLMs are any different. They are literally trained on copyrighted inputs. Either they're violating copyrights or we're being oppressed by these copyright monopolists who say we can't do stuff we should be able to do. Both cannot be true at the same time.

> At least it would be really hard to get a court to regard it as an infringement.

It's extremely hard to get a court to do anything. As in tens of thousands if not hundreds of thousands of dollars difficult. Nothing is decided until actual judges start deciding things, and to get to that point you need to actually go through the legal system, and to do that you need to pay expensive lawyers lots of money. It's the reason people instantly fold the second legal action is threatened, doesn't matter if they're right. Corporations have money to burn, we don't.

And that's assuming that courts are presided by honest human beings who believe in law and reason instead of political activist judges or straight up corrupt judges who can be lobbied by industry.

wakawaka28 · 2025-10-27T05:24:06 1761542646

>I was once told that corporate programmers are warned by legal not to even read AGPLv3 source code, lest it subconsciously infect their thought processes and the final result.

There are different views out there about this. If you literally just copy a piece of code and make stupid changes, it might be a derivative work. But this is not guaranteed. There are times when there is one idiomatic way to do a thing, so your code will necessarily be similar to other code in the world. That type of code is not copyrightable, even if it appears in a larger work that is copyrightable. A small amount of bog standard code resembling something in another project is not in and of itself evidence of infringement.

Corporations would rather not have to deal with unnecessarily similar code or deliberate copyright or patent infringement. So they generally tell you not to look at anything else.

You should read over the criteria for something to be copyrightable: https://guides.lib.umich.edu/copyrightbasics/copyrightabilit...

The biggest issue is that any individual part of a copyrighted work may not be copyrightable. If you dissect a large copyrighted work, it probably contains many uncopyrightable structures. For example, in a book, most phrases and grammatical structures are not copyrightable. The style is not copyrightable. In code, common boilerplate is probably not copyrightable. Please don't make your stuff bizarre to add to its originality though. We need to be able to read and understand it lol.

>There is absolutely no reason to believe LLMs are any different. They are literally trained on copyrighted inputs. Either they're violating copyrights or we're being oppressed by these copyright monopolists who say we can't do stuff we should be able to do. Both cannot be true at the same time.

People who learn to program by reading open-source code are also trained on copyrighted inputs. Copyright may need some rethinking to cope with the reality of LLMs. Unless it is proven that AI is spitting out unique and copyrightable blocks of code from other projects, it really isn't infringing. People do this type of shit all the time. Have you ever looked at Stack Overflow and copied a couple of lines from it? You probably infringed on someone's copyright.

Ultimately, even if you prove copyright infringement happened, you have basically no recourse unless you also prove damages. Since open-source code is public and given away for free, the only possible damage is generally in being deprived of contributions that might have resulted from direct usage of your code. But direct integration of your entire project might have been highly unlikely anyway. Like it or not, people can be inspired by your work in a way that can't be proven without their direct confession.

thesz · 2025-10-26T18:39:50 1761503990

  > copyright shouldn't even exist to begin with.

You then get trade secretes and guilds. Hardly an improvement.

yencabulator · 2025-10-27T15:56:10 1761580570

It sounds like you're confusing copyright with patents.

matheusmoreira · 2025-10-26T19:01:15 1761505275

Secrets? Just leak them, it only has to happen once. Guilds? Revoke their privileges and protections, and there's nothing they can do about it.

Absolutely an improvement. Information wants to be free. Stop criminalizing it and people will find a way to free it. And once it's out there it's over, there is no containing it.

wakawaka28 · 2025-10-26T19:18:41 1761506321

People want to be paid for their work. If you don't let them, they won't do the work. "Information" does not have a mind of its own.

Even when the idea of a thing is "out there" there is a lot of grunt work and special stuff that needs to be implemented to get the best outcomes. Nobody owes you that work for free. Regardless of what GPL copers say, it is very hard to make money with software without enforcing some access restrictions and IP. Open source is great when it works, but it does not work for most things nor is it at the leading edge for most things.

matheusmoreira · 2025-10-26T20:37:46 1761511066

Welp. Then don't do the work if you don't want to. Nobody's advocating for your enslavement.

wakawaka28 · 2025-10-26T21:09:06 1761512946

Copyright facilitates people getting paid for IP work. Without it the work would be much harder to justify.

matheusmoreira · 2025-10-26T21:27:37 1761514057

Then let it be unjustified, and let it stay undone.

Copyright is a functionally perpetual state granted monopoly on information, on numbers. A business model that depends on such a delusion should not even exist to begin with.

wakawaka28 · 2025-10-27T05:01:53 1761541313

Sorry, a piece of software is no more a bunch of numbers than a musical score or a novel. Intellectual property exists for valid reasons. It's clearly not delusional because we've made it work for hundreds of years.

Saying that intellectual property is founded on a delusion because someone can infringe on it or it's mere numbers is like saying your physical property rights are a delusion because the stuff is mere matter and can be manipulated by anyone. While true, such statements add absolutely nothing to the conversation. If people are willing to copy from others rather than make their own creations, there is value in the original works that is a direct result of someone's labor, not some immutable law of the universe.

matheusmoreira · 2025-10-27T11:50:50 1761565850

It's not the age of printing presses anymore. It's the 21st century, the age of information, the age of globally networked pocket supercomputers. Copying is so trivial that it's become normal. It happens every day at massive scales. People do not even realize they are doing it.

We're discussing this in a thread about AI laundering of copyrighted works, for god's sake. If you keep believing this delusion, you'll eventually watch it shatter right before your eyes.

Let's not waste time comparing bits of infomation to physical objects either. We can revisit this discussion when Star Trek replicators are invented.

wakawaka28 · 2025-10-27T13:20:55 1761571255

>It's not the age of printing presses anymore.

Actually printing presses are alive and well.

>It's the 21st century, the age of information, the age of globally networked pocket supercomputers. Copying is so trivial that it's become normal. It happens every day at massive scales. People do not even realize they are doing it.

This is a bunch of hand-waving. Information is still bought and sold, and protected by law. Dissemination by the Internet is not all that different from dissemination on paper or via radio waves. People who make illegal copies of media know what they are doing.

>We're discussing this in a thread about AI laundering of copyrighted works, for god's sake. If you keep believing this delusion, you'll eventually watch it shatter right before your eyes.

Again, the AI is doing something not all that different from what an intelligent human would do. The fact it is done by a machine is only marginally relevant, because we are getting into philosophical questions about how it works and so on. Even if AGI is achieved, I think copyright will be extended to include the output of the machines. But what might change is that the value of information goes down as it gets easier to produce.

>Let's not waste time comparing bits of infomation to physical objects either. We can revisit this discussion when Star Trek replicators are invented.

It's not a waste. Even in the case where replicators existed, the output of a replicator would cost something because energy is not free. I don't think it is free in Star Trek if memory serves me right. The replicator and the holodeck, being finite resources, must be allocated intelligently and fairly among the crew. Same for the physical space aboard the ship. If anyone was to be a pack rat with unlimited replicator access, they might flood an entire deck with trinkets.

Likewise, even though copying is easy, it still represents a theft from the producer of the information. We are only debating details of the mechanics of how it works, to decide whether AI actually copies enough to be infringing and what the damages might actually be. That is one big question. That's one line of questioning. The other is, can the output of an AI be copyrighted? I think the answer to that question is definitely yes.

isodev · 2025-10-26T16:26:54 1761496014

Not really, only a handful of authorities have weighed on that and most of them in a country where model providers literally buy themselves policy and judges.