OPs idea is about having a new GPL like license with a "may not be used for LLM training" clause.
That the LLM itself is not allowed to produce copyrighted work (e.g. just copies of works or too structurally similar) without using a license for that work is something that is probably currently law. They are working around this via content filters. They probably also have checks during/after training that it does not reproduce work that is too similar.
There are law suits about this pending if I remember correctly e.g. with the New York Times.
The issue is that everyone is focusing on verbatim (or "too similar") reproduction.
LLMs themselves are compressed models of the training data. The trick is the compression is highly lossy by being able to detect higher-order patterns instead of fucusing on the first-order input tokens (or bytes). If you look at how, for example, any of the Lempel-Ziv algorithms work, they also contain patterns from the input and they also predict the next token (usually byte in their case), except they do it with 100% probability because they are lossless.
So copyright should absolutely apply to the models themselves and if trained on AGPL code, the models have to follow the AGPL license and I have the right to see their "source" by just being their user.
And if you decompress a file from a copyrighted archive, the file is obviously copyrighted. Even if you decompress only a part. What LLMs do is another trick - by being lossy, they decompress probabilistically based on all the training inputs - without seeing the internals, nobody can prove how much their particular work contributed to the particular output.
But it is all mechanical transformation of input data, just like synonym replacement, just more sophisticated, and the same rules regarding plagiarism and copyright infringement should apply.
---
Back to what you said - the LLM companies use fancy language like "artificial intelligence" to distract from this so they can they use more fancy language to claim copyright does not apply. And in that case, no license would help because any such license fundamentally depends on copyright law, which as they claim does not apply.
That's the issue with LLMs - if they get their way, there's no way to opt out. If there was, AGPL would already be sufficient.
I agree with your view. One just has to go into courts and somehow get the judges to agree as well.
An open question would be if there is some degree of "loss" where copyright no longer applies. There is probably case law about this in different jurisdictions w.r.t. image previews or something.
With LLMs, if you did the first in the past, then no matter what license you chose, your work is now in the second category, except you don't get a dime.
> Not just to the people I agree with, but to anyone who needs to use a computer.
Why not say "... but to the people I disagree with"?
Would you be OK knowing your code is used to cause more harm than good? Would you still continue working on a hypothetical OSS which had no users, other than, say, a totalitarian government in the middle east which executes homosexuals? Would you be OK with your software being a critical directly involved piece of code for example tracking, de-anonymizing and profiling them?
As for me that's a risk I'm willing to accept in return for the freedom of the code.
I'm not going to deliberately write code that's LIKELY to do more harm than good, but crippling the potential positive impact just because of some largely hypothetical risk? That feels almost selfish, what would I really be trying to avoid, personally running into a feel-bad outcome?
I agree with the GP. While I wouldn’t be happy about such uses, I see the use as detached from the software as-is, given (assuming) that it isn’t purpose-built for the bad uses. If the software is only being used for nefarious purposes, then clearly you have built the wrong thing, not applied the wrong license. The totalitarian government wouldn’t care about your license anyway.
The one thing I do care about is attribution — though maybe actually not in the nefarious cases.
During the gold rush, it is said, the only people who made money were the ones selling the pickaxes. A"I" companies are ~selling~ renting the pickaxes of today.
(I didn't come up with this quote but I can't find the source now. If anything good comes out of LLMs, it's making me appreciate other people's more and trying to give credit where it's due.)
To be honest, I haven't looked at any statistics but I imagine a tiny few of those looking for gold found any and got rich, the most either didn't find anything, died of illness or exposure or got robbed. I just like the quote as a comparison. Updated the original comment to reflect I haven't checked if it's correct.
I recall a basics of law class saying that in some countries (e.g. Czech Republic), open source contributors have the right to small compensation if their work is used to a large financial benefit.
At some point, I'll have to look it up because if that's right, the billionaires and wannabe-trillionaires owe me a shitton of money.
If you want, I made a coherent argument about how the mechanics of LLMs mean both their training and inference is plagiarism and should be copyright infringement.[0] TL;DR it's about reproducing higher order patterns instead of word for word.
I haven't seen this argument made elsewhere, it would be interesting to get it into the courtrooms - I am told cases are being fought right now but I don't have the energy to follow them.
Plus as somebody else put it eloquently, it's labor theft - we, working programmers, exchanged out limited lifetime for money (already exploitative) in a world with certain rules. Now the rules changed, our past work has much more value, and we don't get compensated.
In a court of law you're going to have to argue that something is an expression instead of an idea. Most of what LLMs pump out are almost definitionally on the idea side of the spectrum. You'd basically have to show verbatim code or class structure at the expressive level to the courts.
Thanks for the links, I'll read them in more detail later.
There's a couple issues I see:
1) All of the concepts were developed with the idea that only humans are capable of certain kinds of work needed for producing IP. A human would not engage in highly repetitive and menial transformation of other people's material to avoid infringement if he could get the same or better result by working from scratch. This placed, throughout history, an upper limit on how protective copyright had to be.
Say, 100 years ago, synonym replacement and paraphrasing of sentences were SOTA methods to make copies of a book which don't look like copies without putting in more work than the original. Say, 50 years ago, computers could do synonym replacement automatically so it freed up some time for more elaborate restructuring of the original work and the level of protection should have shifted. Say, 10 years ago, one could use automatic replacement of phrases or translation to another language and back, freeing up yet more time.
The law should have adapted with each technological step up and according to your links it has - given the cases cited. It's been 30 years and we have a massive step up in automatic copying capabilities - the law should change again to protect the people who make this advancement possible.
Now with a sufficiently advanced LLM trained on all public and private code, you can prompt them to create a 3D viewer for Quake map files and I am sure it'll most of the time produce a working program which doesn't look like any of the training inputs but does feel vaguely familiar in structure. Then you can prompt it to add a keyboard-controlled character with Quake-like physics and it'll produce something which has the same quirks as Quake movement. Where did bunny hopping, wallrunning, strafing, circlejumps, etc. come from if it did not copy the original and the various forks?
Somebody had to put in creative work to try out various physics systems and figure out what feels good and what leads to interesting gameplay.
Now we have algorithms which can imitate the results but which can only be created by using the product of human work without consent. I think that's an exploitative practice.
2) It's illegal to own humans but legal to own other animals. The USA law uses terms such as "a member of the species Homo sapiens" (e.g. [0]) in these cases.
If the legality of tech in question was not LLMs but remixing of genes (only using a tiny fraction of human DNA) to produce a animals which are as smart as humans with chimpanzee bodies which can be incubated in chimpanzee females but are otherwise as sentient as humans, would (and should) it be legal to own them as slaves and use them for work? It would probably be legal by the current letter of the law but I assure you the law would quickly change because people would not be OK with such overt exploitation.
The difference is the exploitation by LLM companies is not as overt - in fact, mane people refer to LLMs as AIs and use pronouns such as "he" or "she", indicating them believe them to be standalone thinking entities instead of highly compressed lossy archives of other people's work.
3) The goal of copyright is progress, not protection of people who put in work to make that progress possible. I think that's wrong.
I am aware of the "is" vs "should" distinction but since laws are compromises between the monopoly in violence and the people's willingness to revolt instead of being an (attempted) codification of a consistent moral system, the best we can do is try to use the current laws (what is) to achieve what is right (what should be).
And HN does its thing again - at least 3 downvotes, 0 replies. If you disagree, say why, otherwise I have to assume my argument is correct and nobody has any counterarguments but people who profit from this hate it being seen.
> programmer who actually do like the actual typing
It's not about the typing, it's about the understanding.
LLM coding is like reading a math textbook without trying to solve any of the problems. You get an overview, you get a sense of what it's about and most importantly you get a false sense of understanding.
But if you try to actually solve the problems, you engage completely different parts of your brain. It's about the self-improvement.
We've been hearing this a lot, but I don't really get it. A lot of code, most probably, isn't even close to being as challenging as a maths textbook.
It obviously depends a lot on what exactly you're building, but in many projects programming entails a lot of low intellectual effort, repetitive work.
It's the same things over and over with slight variations and little intellectual challenge once you've learnt the basic concepts.
Many projects do have a kernel of non-obvious innovation, some have a lot of it, and by all means, do think deeply about these parts. That's your job.
But if an LLM can do the clerical work for you? What's not to celebrate about that?
To make it concrete with an example: the other day I had Claude make a TUI for a data processing library I made. It's a bunch of rather tedious boilerplate.
I really have no intellectual interest in TUI coding and I would consider doing that myself a terrible use of my time considering all the other things I could be doing.
The alternative wasn't to have a much better TUI, but to not have any.
> It obviously depends a lot on what exactly you're building, but in many projects programming entails a lot of low intellectual effort, repetitive work.
I think I can reasonably describe myself as one of the people telling you the thing you don't really get.
And from my perspective: we hate those projects and only do them if/because they pay well.
> the other day I had Claude make a TUI for a data processing library I made. It's a bunch of rather tedious boilerplate. I really have no intellectual interest in TUI coding...
From my perspective, the core concepts in a TUI event loop are cool, and making one only involves boilerplate insofar as the support libraries you use expect it. And when I encounter that, I naturally add "design a better API for this" to my project list.
Historically, a large part of avoiding the tedium has been making a clearer separation between the expressive code-like things and the repetitive data-like things, to the point where the data-like things can be purely automated or outsourced. AI feels weird because it blurs the line of what can or cannot be automated, at the expense of determinism.
And so in the future if you want to add a feature, either the LLM can do it correctly or the feature doesn’t get added? How long will that work as the TUI code base grows?
At that point you change your attitude to the project and start treating it like something you care about, take control of the architecture, rewrite bits that don't make sense, etc.
Plus the size of project that an LLM can help maintain keeps growing. I actually think that size may no longer have any realistic limits at all now: the tricks Claude Code uses today with grep and sub-agents mean there's no longer a realistic upper limit to how much code it can help manage, even with Opus's relatively small (by today's standards) 200,000 token limit.
I've also been hearing variations of your comment a lot too and correct me if I am wrong but I think they always implicitly assume that LLMs are more useful for the low-intellectual stuff than solving the high-intellectual core of the problem.
The thing is:
1) A lot of the low-intellectual stuff is not necessarily repetitive, it involved some business logic which is a culmination of knowing the process behind what the uses needs. When you write a prompt, the model makes assumptions which are not necessarily correct for the particular situation. Writing the code yourself forced you to notice the decision points and make more informed choices.
I understand your TUI example and it's better than having none now, but as a result anybody who wants to write "a much better TUI" now faces a higher barrier to entry since a) it's harder to justify an incremental improvement which takes a lot of work b) users will already have processes around the current system c) anybody who wrote a similar library with a better TUI is now competing with you and quality is a much smaller factor than hype/awareness/advertisement.
We'll basically have more but lower quality SW and I am not sure that's an improvement long term.
2) A lot of the high-intellectual stuff ironically can be solved by LLMs because a similar problem is already in the training data, maybe in another language, maybe with slight differences which can be pattern matched by the LLM. It's laundering other people's work and you don't even get to focus on the interesting parts.
> but I think they always implicitly assume that LLMs are more useful for the low-intellectual stuff than solving the high-intellectual core of the problem.
Yes, this follows from the point the GP was making.
The LLM can produce code for complex problems, but that doesn't save you as much time, because in those cases typing it out isn't the bottleneck, understanding it in detail is.
> LLM coding is like reading a math textbook without trying to solve any of the problems.
Most math textbooks provide the solutions too. So you could choose to just read those and move on and you’d have achieved much less. The same is true with coding. Just because LLMs are available doesn’t mean you have to use them for all coding, especially when the goal is to learn foundational knowledge. I still believe there’s a need for humans to learn much of the same foundational knowledge as before LLMs otherwise we’ll end up with a world of technology that is totally inscrutable. Those who choose to just vibe code everything will make themselves irrelevant quickly.
I haven't used AI yet but I definitely would love a tool that could do the drudgery for me for designs that I already understand. For instance, if I want to store my own structures in an RDBMS, I want to lay the groundwork and say "Hey Jeeves, give me the C++ syntax to commit this structure to a MySQL table using commit/rollback". I believe once I know what I want, futzing over the exact syntax for how to do it is a waste of time. I heard c++ isn't well supported but eventually I'll give it a try.
> It's not about the typing, it's about the understanding.
Well, it's both, for different people, seemingly :)
I also like the understanding and solving something difficult, that rewards a really strong part of my brain. But I don't always like to spend 5 hours in doing so, especially when I'm doing that because of some other problem I want to solve. Then I just want it solved ideally.
But then other days I engage in problems that are hard because they are hard, and because I want to spend 5 hours thinking about, designing the perfect solution for it and so on.
Different moments call for different methods, and particularly people seem to widely favor different methods too, which makes sense.
> LLM coding is like reading a math textbook without trying to solve any of the problems. You get an overview, you get a sense of what it's about and most importantly you get a false sense of understanding.
Can be, but… well, the analogy can go wrong both ways.
This is what Brilliant.org and Duolingo sell themselves on: solve problems to learn.
Before I moved to Berlin in 2018, I had turned the whole Duolingo German tree gold more than once, when I arrived I was essentially tourist-level.
Brilliant.org, I did as much as I could before the questions got too hard (latter half of group theory, relativity, vector calculus, that kind of thing); I've looked at it again since then, and get the impressions the new questions they added were the same kind of thing that ultimately turned me off Duolingo, easier questions that teach little, padding out a progressions system that can only be worked through fast enough to learn anything if you pay a lot.
Code… even before LLMs, I've seen and I've worked with confident people with a false sense of understanding about the code they wrote. (Unfortunately for me, one of my weaknesses is the politics of navigating such people).
Yeah, there's a big difference between edutainment like Brilliant and Duolingo and actually studying a topic.
I'm not trying to be snobbish here, it's completely fine to enjoy those sorts of products (I consume a lot of pop science, which I put in the same category) but you gotta actually get your hands dirty and do the work.
It's also fine to not want to do that -- I love to doodle and have a reasonable eye for drawing, but to get really good at it, I'd have to practice a lot and develop better technique and skills and make a lot of shitty art and ehhhh. I don't want it badly enough.
Lately I've been writing DSLs with the help of these LLM assistants. It is definitely not vibe coding as I'm paying a lot of attention to the overall architecture. But most importantly my focus is on the expressiveness and usefulness of the DSLs themselves. I am indeed solving problems and I am very engaged but it is a very different focus. "How can the LSP help orient the developer?" "Do we want to encourage a functional-looking pipeline in this context"? "How should the step debugger operate under these conditions"? etc.
I spent 10 years writing open source, I haven't touched it in the last 2. I wrote for multiple reasons none of which any longer apply:
- I believe every software project should have an open source alternative. But writing open source now means useful patterns can be extracted and incorporated into closed source versions _mechanically_ and with plausible deniability. It's ironically worse if you write useful comments.
- I enjoyed the community aspect of building something bigger than one person can accomplish. But LLMs are trained on the whole history and potentially forum posts / chat logs / emails which went into designing the SW too. With sufficiently advanced models, they effectively use my work to create a simulation of myself and other devs.
- I believe people (not just devs) should own the product they build (an even stronger protection of workers against exploitation than copyright). Now our past work is being used to replace us in the future without any compensation.
- I did it to get credit. Even though it was a small motivation compared to the rest, I enjoyed everyone knowing what I accomplished and I used it during job interviews. If somebody used my work, my name was attached to it. With LLMs, anyone can launder it and nobody knows how useful my work was.
- (not solely LLM related) I believed better technology improves the world and quality of life around me. Now I see it as a tool - neutral - to be used by anyone for both good and bad purposes.
Here's[0] a comment where I described why it's theft based on how LLMs work. I call it higher order plagiarism. I haven't seen this argument made by other people, it might be useful for arguing about those who want to legalize this.
In fact, I wonder if this argument has been made in court and whether the lawyers understand LLMs enough to make it.
I think about better voting systems all the time (one major issue being downvote can mean "I want fewer people to see this", "I disagree", and "This is factually wrong" and you never know which.
But I am not sure if SO's is actually that good, given it led to this toxic behavior.
I think something like slashdot's metamoderation should work best but I never participated there nor have I seen any other website use anything similar.
Arstechnica used to have different kinds of upvotes for "funny" vs "insightful" - I forget exactly all of them. But I found it awesome. I wanted to and could read the insightful comments, not the funny ones. A couple years back they redid the discussion system and got rid of it. Since then the quality of discussion has IMHO completely tanked.
reply