Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
High Court rules that Getty vs. Stability AI case can proceed (technollama.co.uk)
110 points by AndrewDucker on Dec 9, 2023 | hide | past | favorite | 122 comments


Important to note this is not a ruling about copyright infringement: Stability has not yet even presented a defense on that subject. Instead they argued that although they are a UK-based company, no people or computers residing in the UK were involved in the development of Stable Diffusion, so a UK court has no jurisdiction on the matter. The judge didn’t find the evidence for this (which includes the CEO saying he asked around and is “confident” it’s true) conclusive enough, so things will move forward. It may turn out to be true later, but Getty will get a chance to present counter-evidence, and so on.


I truly don't see how it matters where the computers are is he saying they airgapped their corporation and never looked at any output? that their product never operated in the UK?

Bizarre.


You can infringe in the UK by serving copyrighted content to users in the UK regardless of where the actual training occurred. This take on copyright law is asinine and juvenile and I’m not surprised the judge struck it down.

Can you imagine if you had no piracy occurring in the US when users download pirated content from a server in the Philippines and you could only handle the case under the Filipino legal system?


But that's the issue, they're not serving copyright content to users in the UK because the outputs aren't derivative, the legal question is very narrow, it is about the training of the model.

There will be a separate trade mark question on the outputs, which wasn't part of the request to dismiss the case.

>Can you imagine if you had no piracy occurring in the US when users download pirated content from a server in the Philippines and you could only handle the case under the Filipino legal system?

Your hypothetical is about actual distribution and communication to the public of copyright works, this is very different to what is at stake here.


I think we fundamentally disagree on the nature of the outputs. I think it's extremely rare that a neural network trained on copyrighted material will produce zero outputs that are derivative of some copyrighted work. It might take a bit of experimentation or searching to find the derivative outputs but historically for any one of these neural nets that has been sufficiently public to allow experimentation we have found examples.


I don't think copyright has much meaning left in it anymore. Any work can be "extracted" into its elements, ideas and style, and recombined in million ways.

You could generate a billion images with SD and train the next model on them. Make sure they don't look close to copyrighted works. Being AI generated, they have no copyright. You can still use real data as well if it is in the public domain.

If you do this enough the initial copyrighted dataset is going to be further removed from the model. The model can't reproduce a copyrighted work because it hasn't seen any of them during training.

But more importantly, this process strictly separates ideas from expression and trains only on ideas without copyrighted expression. If authors complain it means they want to own ideas and styles.

You can also use copyrighted works to train a classifier to rank the quality of training examples, and apply it to filter your synthetic data to be higher quality.

You can even train a RLHF model to say when two works are "close enough" to constitute an infringement, and double down on safety by ensuring you don't generate or use risky works.

That's why I was saying that I don't think copyright has much meaning left in it anymore. Knowledge wants to be free, it travels, shape-shifts and evolves. It does not belong to any one of us except if we keep it to ourselves.


My gut says that you're approaching this at the wrong level of abstraction.

If new tech circumvents the law, the law can easily be changed, so:

> If authors complain it means they want to own ideas and styles.

is "yes, and?"

Also, copyright as it currently exists (a legal construct) is supposedly to promote the arts. GenAI may obviate the economic need to promote the arts… but if art is to us as a peacock's tail, then it won't ever obviate the desire to promote (and protect) the arts. The expense of human labour may be the point.


> is "yes, and?"

It would be self defeating for artists. The same tools developed to scan AI art for copyright infringement will trigger on their own works as well. All "side inspiration" will be revealed, it will have a chilling effect on freedom.

Do you want art to be made of little islands of copyright, where someone staked their claim - nobody else is allowed to create? If they can own styles they can ban others from using those styles.

Let's explain it like this: in a system, a user has ownership over a few files. But from now on users will own whole extensions, like ".json" and own everything "*.json" instead of "myreport.json". They want copyright wildcard "*.©", it's a power grab. They want copyright to be more like patents.


> The same tools developed to scan AI art for copyright infringement will trigger on their own works as well

Perhaps; frightened and angry people often can't see the consequences of the direction they're running in.

But: if the effort is the point of art, then "proof of work" is how that goes down, not your scenario. Filming the artist as they put oil to canvas etc.

> They want copyright to be more like patents.

Not patents, trademarks. Patents cover novel specific inventions for a few years, trademarks cover anything that's similar enough it might confuse a customer for as long as you maintain it.

And what I want is almost irrelevant, the question is the aggregate will of society. My preferences feel like they're negatively correlated with public opinion on the topics loud people discuss most.


> Being AI generated, they have no copyright.

Careful with this assumption. The US Copyright Office have said this. Some other jurisdictions may have similar statements. But few (any?) claims around this have been tested in court, and specifically there is still the risk that there may be jurisdictions that find the output to be infringing if the training data is.

It will take time for this to shake out.

I do mostly agree with you that I think enough doors will be left open that it will be possible to find paths that will effectively "work around" copyright in fairly significant ways, but it's not so open and shut what will be ok and what won't.


the most likely outcome seems to be a black hole event where new rules of copyright will be established.

I think if you replaced human with computer and went through all the training, you come to the conclusion that the human is recreating copyrighted works . so if you the send that human out to all these people knowing what they know, they're creating infringement.

the hinge will be that the training data was never licensed by the model maker, the trainer for the human computer analog.

the only thing limiting damage is that the scale of impact is unmanageable.


If so, it will not affect the progress of AI, because the AI companies will simply buy sufficient data. E.g. OpenAI's valuation alone means it could buy the largest publishers and image banks if they wanted.

What an outcome like that would do, however, would be to massively centralise the ability to train legal models.

At the same time you'd end up with a lot of "washing" of training data (a lot of "works for hire" that'd really be outputs of models trained on copyrighted works), and a lot of development moving to other countries.


consider what a properly vetted model looks like. besides obtaining copyright it'd also become a wiki like source combed through and cleaned up.

I'm pretty sure we are at the low hanging fruit and waiting for a mechanical turk to properly put a training set together.


You can’t trust the output to be accurate yet, the resulting model would be worse than the original for artefacts.


In smaller LLMs it has been demonstrated that purely synthetic datasets can make a model 5x more efficient. So a 1.5B model can score like a 7B model. The model is Phi-1.5 from Microsoft.

But you can also empower the LLM, it can use more tokens, chain of thought, multiple rounds of LLM inference, use other specialized models, use tools, execute code, use search, and have a human in the loop. Does that remind you of anything? Yes, "OpenAI GPTs", they are the empowered LLM environments that can create data superior to LLM alone.

The general recipe is LLM+something extra = smarter than LLM. That something extra is usually a simulation or code execution or some real world interaction. This generates LLM error examples and feedback to learn to fix them in the next iteration. It is targeted training data.


How does knowledge want to be free? We're talking about generating derivatives of other people's art. Instead of hiring an artist I can ask the model to generate art in the same style for much less. Is that right?


What I meant was that knowledge distillation from trained models is a very efficient process. That's why it "wants" to be free.

Knowledge is easy to scrape, transform and train on. Once it gets into the open datasets, all models trained on that data will inherit the skills.

In language, more recently the most useful training data has been generated with GPT4. Its abilities get transferred very efficiently to lower models.

I think the same will happen with images. We're going to generate synthetic datasets that would replace the original copyrighted works, but be more efficient and diverse.


Instead of hiring Banksy, I can ask a guy down the hall to draw me something in the style of Banksy.


The concern is commercial art, not celebrity art, where the artist trumps the art. The way I see it playing out is that potential clients will tell the generator what to emulate and even link to or upload some samples. As tools continue to improve, the market for professional artists is going to shrink. To add insult to injury, the very art they create is being used to take away their livelihoods. That does not strike me as fair.


You can already ask artist A to make art in the style of artist B. So if you ask a machine to do the same thing, it's almost certainly going to be ruled legal by the courts.

The concept of having copyright over a style is ridiculous anyway. Our courts should have their time taken up by making subjective determinations of whether some art is in the style of another.


The training jurisdiction one is interesting. Are future companies going to exclusively train their models in a copyright lack country?

Seems like jurisdiction would be based on the copyright of the allegedly infringed images, and UK-based users creating copyright infringing copies in the UK.

But that's apparently not the law or case law in the UK yet.


As anyone stealing data or other digital goods in countries that dont respect property they will be punished when trying to sell in countries there property is protected.


Still not convinced training AI is theft, and I think copyright law is a cudgel used by powerful corpos to extract rent and smash innovation.


While I somewhat agree with that take on copyright, I think you have to pick a lane to keep that position coherent:

Either you insist that copyright must be respected at every level, and the creators of material used for training deserve appropriate compensation, or

You throw out copyright completely in this context, but that means the resulting models cannot be treated as proprietary either unless they were produced using absolutely no unlicensed training data.

I think there is an argument for both. Want to create a proprietary model for commercial use? Pay up. Creating an open source, copyleft project exclusively for personal use and artistic expression? Exemption.

The current status quo is perfectly described by powerful corpos extracting rent. Billions for themselves and pennies for the average artist.


> Either you insist that copyright must be respected at every level, and the creators of material used for training deserve appropriate compensation

I don't think that current copyright laws automatically entitles people to royalties from something like AI-generated imagery. The dichotomy you've presented here isn't pro-copyright vs anti-copyright, but "so pro-copyright that they argue for expanding the current laws" vs not.

> Want to create a proprietary model for commercial use? Pay up. Creating an open source, copyleft project exclusively for personal use and artistic expression? Exemption.

That definitely benefits all the "powerful corpos" you've mentioned here. Now, Disney, Adobe, Meta etc. can use a fraction of their money to get all the data they would ever need and be the sole profiteers, while all newcomers will face an impassable barrier to entry that prevents them from ever threatening the existing players.


Copyright already has a lot of limitations. It has never been, nor been intended to, be absolute, because the point is to promote the arts and a too strict grant of rights would stifle it instead - indeed most places it is accepted that copyright is a significant limitation on the liberty of society at large, justified (or not, depending on ones opinions) by encouraging more works, but accepting it as a restriction means there is some degree of acceptance that it should not be more expansive than it needs to be (and many will disagree about whether the current length of copyright is or is not more expansive than it needs to be)

The only limitation that needs to be there for training on copyrighted works not to be infringing is to accept that extracting information about the work is not infringing if copyrighted elements of the work itself is not significantly reproduced.


There is at least one middle ground area where you acknowledge that copyright and intellectual property restrictions should be removed, but that we should also recognize that all of the existing work was created by artists who expected they would have copyright protection. We should in my view not take from artists without their consent, and there is no implied consent when their works were posted at a time they believed they were protected by copyright.

This would mean we have to do a few difficult and worthwhile things: explicitly dismantle the copyright system, encourage artists to donate their existing works to the commons, and then only make datasets based on legally collected information. This would also have the side effect of encouraging the development of new training techniques and model designs which are more sample efficient.

I am afraid that what we will do instead is allow some erosion of copyright for small creators without dismantling the power large intellectual property holders have over the rest of us.


> We should in my view not take from artists without their consent

I think "take" is the wrong word here, nobody is republishing the copyrighted works, instead the model gets a gradient update. The update is shaped exactly like the model itself, and it gets stacked up with other updates from other examples. It doesn't look like the original work at all, the original work was a picture or book, the gradients look like a set of floating point tensors. AI models decompose inputs into basic concepts, they don't copy like bittorrent.

Why should an AI not be allowed to form a full world model that includes all published works? It's not like the authors can use copyright to stop anyone from seeing their works, they never had a right to stop others from seeing.


I am more arguing that if it’s considered taking, we should follow the path I recommend.

Whether or not it is taking is more nuanced, but I will say I’m not sympathetic to the idea that it’s broadly similar to a human looking at the work. It’s just very, very different. You can’t spin up a copy of a human on a cloud server and make them work 24/7.

I would expect that as laypeople we aren’t equipped to reason about this effectively. I suspect that decades or more of case law would be relevant to how this would be viewed, and I’m personally not equipped to argue it.

What I do know is that artists don’t feel good about it. They feel like they’re being taken from. And I’m not inclined to quickly dismiss their concerns. I think this needs careful, deliberate consideration. And if a system could be built that is consent based, I’d feel much better about it. A human child could be raised and mature without ever being exposed to copyrighted material beyond a handful of books (harder in the modern world but common 200 years ago). Maybe we just need to build better models. It certainly seems possible.


if you can legally use something its not stealing.

Even the argument that its logical to call it that isn't certain.

If i take your picture I own it?

If you take a picture and i upload it to fb meta gets to use it?

If I publish your book under my name and no one finds out, did I write it?

If no author can be found, may I read it?


> I think copyright law is a cudgel

It is interesting to me that we are finally seeing a case where many smaller, independent artists and creators are using these laws to assert their rights against the encroachment of the moneyed tech interests, and of course now all of the powerful corpos are singing another tune. Rules for thee, not for me.


Yep. That is the question. Anyone that immediately comes back with “it’s stealing!”, especially those that were confidently saying it when this first became an issue, long before they would’ve had any time to contemplate it deeply, are just proving that techies’ sense of transferable expertise is completely unfounded.


No it isn’t, because ‘steeling’ is allowed.

There’s no question these neural networks and their output are derivative works. However being a derivative work isn’t enough to guarantee copyright infringement.

So, the only question is if we are going to carve out an exception here or not. The idea someone can use a VCR to copy live TV and let people watch it later came out of a court case not copyright law. There’s a lot of such exceptions, but getting one isn’t guaranteed.


> There’s no question these neural networks and their output are derivative works.

In the two US cases we have any progress on so far, the established requirement for substantial similarity (opposed to "dependant on" or such) has been upheld, with Judge Vince Chhabria specifically setting out that it'd "have to mean that if you put the Llama language model next to Sarah Silverman's book, you would say they're similar". and Judge William H. Orrick agreeing with the defendants that "plaintiffs cannot plausibly allege the Output Images are substantially similar or re-present protected aspects of copyrighted Training Images, especially in light of plaintiffs’ admission that Output Images are unlikely to look like the Training Images".

The UK definition of derivative works is, to my understanding, narrower and specifically enumerated as opposed to the US's more open-ended definition.

The remaining area of doubt, assuming the above remains consistent, is over the transient copying that occurs during training.


> the transient copying that occurs during training.

i think this should be dismissed as it is the same level of transience as the workings of the internet; you and your ISP, caching proxies etc, all made a transient copy as part of the existing (legal) consumption of the works that the author has put online.

Unless the works was illegally copied for training - which cannot be true if the works was publicly available for viewing on the internet, this transient copying cannot be a valid infringement.


Doing something a little isn’t the same a doing something a lot. You can walk into a restaurant and look at a menu for 5 minutes and then leave without issue but try to do that same thing for 8 hours.

Downloading a singe transient copy of some image once in the lifetime of a company is different than doing that same action a hundred times once for each version of the network.


This case involves a many examples of substantial similarity. Worse it’s precedent that generative AI doesn’t necessarily avoid creating such examples.

Defendants can easily argue that being 1/10 millionth or whatever of the training set means their specific work is unlikely to show up in any specific example but the underlying mechanism means it can be recreated.


The defendants will evidently claim transient copying.


I doubt these companies constantly downloading the full training set rather than keeping it in a database somewhere.

Hard to argue keeping a copy of some copyrighted work indefinitely counts as transient.


> I doubt these companies constantly downloading the full training set rather than keeping it in a database somewhere.

Precisely to argue for transient copies, they don't need to keep terabytes of data stored.

>Hard to argue keeping a copy of some copyrighted work indefinitely counts as transient.

You're assuming that they're keeping the works indefinitely, which again is not the case.


> Precisely to argue for transient copies, they don't need to keep terabytes of data stored.

Those kinds of legal workarounds rarely work.

They are dependent persistent access allowing them the equivalent benefit of keeping a persistent copy.


> There’s no question these neural networks and their output are derivative works.

A derivative work is an expressive creation that includes major copyrightable elements of a first, previously created original work (the underlying work).

There is absolutely no agreement that what neural networks do (as a rule) counts as such, so it is not at all correct to say "there is no question..."

If learning how to draw by watching other people draw makes everything you draw a derivative work, then perhaps you have a point.


The network in question recreated the exact content in question on a specific event. What happens is general isn’t the issue, the problem comes from specific output.

For a neural network to be able to recreate a complex work with minimal prompting it must be encode that information and therefore be a derivative work.


There are some ironclad exceptions but they would have to make it through the dysfunctional Congress.

The big one is recipes. Recipes under the current copyright regime in the US are considered non-copyrightable facts, which is why every cookbook and recipe blog has lots of copyrightable splash photos and personal anecdotes. Congress specifically doesn’t want grandmas getting sued for copying the recipe on the box.


> Congress specifically doesn’t want grandmas getting sued for copying the recipe on the box.

Recipes don't have a specific exception within the the copyright law that Congress has carved out.

It is also not cut and dry. It basically boils down to facts not being copyrightable. So a list of ingredients and basic instructions (e.g. cooking time and temperature) won't be granted copyright protection.

But, the prose in the instructions can be copyrighted. So copying a whole recipe verbatim can be copyright infringement, but copying the list of ingredients and writing out the basic instructions is not.


Sounds like a job for LLMs - extract ingredients and steps, then verbalize it back in a completely different style.


But to what end? SEO optimized recipe copy sites already exist and are so numerous to the point where going to specific sites or books is now just a signal of reputability in a sea of trash.


not sure what Congress has to do with a case in the UK

fair use is mostly a US concept, there is no such thing in the UK or most other countries


It seems like UK and EU agree that you cannot copyright a recipe other than maybe the exact way it was written:

https://www.twobirds.com/en/insights/2020/uk/intellectual-pr...

https://www.copyright.eu/docs/protection-of-a-recipe/

Though you can patent novel methods of food production, which is also true in the US.

The root statement is still the same, legislatures can amend copyright laws as they wish if they really care. I don’t know that the UK parliament is exactly functioning well right now, but that’s my impression from across the pond.


> I don’t know that the UK parliament is exactly functioning well right now

in terms of ability to legislate it works considerably better than the US congress

up to you if you call that well functioning


You can only copyright the actual expression of a recipe as a literary work, but the functional aspect, the cake let's say, isn't copyrightable.


> There’s no question these neural networks and their output are derivative works.

Most generated content almost certainly isn’t derivative work by the standards of copyright law. It’s plainly obvious to anybody who’s read Frank Herbert’s books that he derived a lot of ideas from Isaac Asimov, but it’s equally obvious that Dune isn’t a derivative work of Foundation.

If I had some commercial interest in generative AI models, I’d be very happy that everybody is debating the copyright implications. Because copyright law is certainly going to favour the models. The biggest regulatory risk to them as far as I can tell is that they clearly don’t have section 230 protections, and I can’t imagine how that isn’t going to come crashing down around them rather soon.


If you run someone over you can’t defend yourself by saying 99.999% of the time you didn’t run someone over. Most output being free of copyright issues isn’t a defense if any output has those issues.

Specific examples of clear copyright infringement mean that output is a derivative work AND by encoding enough information to recreate it the underlying neural network must itself be a derivative work.


Derivative work has a specific meaning in copyright law, there has to be something in the output, and that's not the case here. Otherwise every single owner of 5 billion images could sue you for your "cat at a cafe" midjourney picture.

Judge Orrick in one of the US cases already called this idea 'nonsense", his words.


Not all outputs are at issue here, but if ANY output is copyright infringement they have problems.

Specific and clear examples of derivative works are shown therefore both those exact examples and the underlying neural network must be a derivative work.


I think that if you stick to a definition of 'fair use' that allows the slurping of entire corpuses, then copyright doesn't have any teeth anymore.

If the license makes the data public viewing, like with websites, then slurp all you want. If the license forbids automated bulk processing, then stop whining about fair use and pay for a license that allows bulk processing.

"Out system actively uses every single byte of data to produce any output" is so obviously not the intention fair use clauses.


Fair use doesn't exist in the UK.


What is your justification for AI training not being theft?


If including a copyrighted work in an AI training corpus is theft because of its influence on some artificial neural net, then so is viewing it by a human being, whose memory is now somehow the property of the copyright holder, an absurd conclusion.


Well if we're concluding that training a neural net and a human mind forming memories is exactly the same thing, I'm looking forward to all their defenders agreeing that the neural nets should be held criminally responsible every time they generate an image deemed unlawful in certain jurisdictions...

Otherwise there's obviously a legally relevant distinction between a human mind which is ascribed agency to decide if and how to use its memories of copyrighted material, and importing into an information retrieval system which can't help but spit out transformations of parts of its inputs on demand, (including lossy representations of the Getty watermark if it's fed enough Getty material, or an exact facsimile of an image if that's all it's trained on...)


There's a massive jump in that logic, which is basically equating a large neural net to being exactly the same as a singular human being. If you ask me that is clearly not the case. They operate in entirely different ways and have very different properties.


>There's a massive jump in that logic, which is basically equating a large neural net to being exactly the same as a singular human being.

It does not assume that.


Are you willing to give up this point of view the first time you fail a Turing test? It seems only fair.


Humans are not treated the same as machines and business ventures. Many things are illegal for the latter and are not thought crimes for the former.


Your premise that a neural net is equivalent to a human brain seems much more absurd.


You wouldn't look at a car.


Copying is not theft in general as you don't take anything away from anybody


The major corps and tech companies loved to claim otherwise for years as it suited their interests.


>>Copying is not theft in general as you don't take anything away from anybody

That's a perfect oxymoron.


Theft has specific meaning in law, and it's reserved for physical property (or unique digital assets and financial instruments like bonds).

Copyright uses infringement, which is not theft: it's non-rivalrous, and it contains a number of exceptions.


Fair use. (Even if the creator doesn't like that)


Training a model on another person’s work shouldn’t be considered theft, but the model also shouldn’t be allowed to generate profits. Without of course compensating the owner of the training data.

Someone needs to come up with a royalty structure for this stuff.


I think using data that you don't have the copyrights to train AI is theft.

That being said, Getty is hardly the paragon of goodwill considering they regularly steal from public domain databases, issue DMCA takedown requests of the stolen content from said databases, and then turn around to sell it to unwitting people for a subscription. They own none of the copyrights for what they are doing but have been allowed to get away with it.


> I think using data that you don't have the copyrights to train AI is theft.

There are public domain works you can use and copyright doesn't protect ideas. It protects expression of ideas, so getting "just the ideas" without the expression is ok.


Right. Public domain is stuff that doesn't have exclusive IP rights. You can do with that what you want.

The problem is that "expression of ideas" in the realm of AI is akin to plagiarism by human standards, because its a literal copying of the source material blended together. I couldn't recite you the entire plot of the Odyssey off the top of my head literally, but AI can, because it has the source material. We just tell it to do funny ha-ha things so its okay.


Have you only read books you own the copyright to?

What’s the legal distinction between you learning and AI learning?


If I regurgiate something I read in copyrighted book without proper license that also would be theft, no distinction there.

I'm not distributing my brain, at least same (but probably more restrictive) should apply to models - training is okay, but using and distributing should be limited by copyright


Explaining anything publicly based on my understanding I got reading books would be illegal following this logic. I'm not sure this is how it works.


They want to muddle the distinction between ideas and expression. You can't copyright ideas. Everyone is entitled to copy ideas.


It would not be illegal based on fair use (though you have to be careful there also), but if you try to regurgiate large portions of the book then it would be. And we do know that models regurgiate training material verbatim (Copilot)


Redistribution, and the scale of it.

Besides which, "learning" isn't a fair use exemption anyway.


Using that which belongs to others without their consent is theft. There isn’t much to debate, unless of course, you wish to benefit from that theft. Powerful corpos can train whatever they wish against the data they own. For instance, microsoft can train its bots on microsoft’s source code, instead of people’s code, but that is not going to happen because they are aware of the implications - the procedural generators are exactly that and nothing more. Meanwhile we cant use their products without a license.

So if they decides to play this game then so should we.


> Using that which belongs to others without their consent is theft

Are text snippets, thumbnnails and site caches shown by search engines (on an opt-out basis) "theft"? If you draw a car, which you can do due to having seen many individually-copyrighted car designs, are you stealing from auto manufacturers? Have I just committed theft by using a portion of your comment above as a quote?

I don't claim here that statistical model fitting inherently needs to be treated the same as the above examples, but rather use examples to show that the bar of "using" is far too broad.

Legally, copyright infringement in the US requires that the works are substantially similar and not covered by Fair Use. Morally, I believe that artificial scarcity, such as evergreening of medical patents, is detrimental and needs to be prevented wherever feasible - and wouldn't call any kind of copying/sharing/piracy "theft". The digital equivalent of theft is, for example, account theft where you're actually removing the object from the owner's possession.


Theft is the act of taking away someone else's property. "Using" (aka copying) the public data I create isn't theft, be it with my consent or without. It may be copyright infringement under certain conditions, but arguing that this infringement is stealing is like arguing that digital piracy and shoplifting are basically the same thing.


> Using that which belongs to others without their consent is theft.

Using publicly available information doesn’t require anybody’s consent.


Thats because the word 'training' is doing all the heavy lifting here. Think of it as copying, compressing and storing all the copyrighted material in a database. Humans learn, humans train, computers encode data. You would never say ffmpeg learned a movie.


> You would never say ffmpeg learned a movie.

no you wouldn't, but these diffusion models do way more than ffmpeg, and do qualitatively different things.

I am on the fence, but i lean towards the side where training an AI using existing works is not infringement, as long as the AI's output is (or can be) majority new works. For example, a poor training algorithm that merely repeats the training dataset (and cannot output new works) is infringing, while a different algorithm (such as the current stable diffusion one) that can output works that has never been made and is totally new, does not infringe - after all, style and ideas are not infringing and if the algorithm managed to extract those ideas from the training set, all the better.


Majority new works is not a good enough standard. If any output is a direct reproduction of a copyrighted input that output is copyright infringement whether it was intended or not. If the trainer of the model doesn’t want to be sued for infringement they are responsible for a robust safety mechanism that prevents it. If that safety mechanism isn’t possible than don’t use copyrighted works if you have any possibility of directly reproducing them.


> If any output is a direct reproduction of a copyrighted input that output is copyright infringement

so by that standard, why isnt photoshop a copyright infringement? You can use it to create a copy just the same.


Photoshop isn’t a copyright infringement inherently but producing an infringed image with photoshop is still infringement. Much the same way AI is not inherently infringement but any production of infringing content by the AI is still infringement.


What’s the test for “has never been made and is totally new”?

If I look at a photo of Prince and then using that image as reference create a new silkscreen painting is that fair use or infringement?

Because the US Supreme Court has ruled that instance I referenced was infringement as both images were used for magazine covers [0].

[0] https://www.nbcnews.com/news/amp/rcna64624


> What’s the test for “has never been made and is totally new”?

the existing copyright rulings are sufficient to determine this, and has nothing to do with ai models.

You've already pointed out a case - if you use an AI to generate an image which has sufficient likeness to an existing one, then the AI portion is irrelevant to the ruling. You could've made that same image in photoshop without AI, and should obtain the same ruling.

But in the above circumstance, the silkscreen used in the creation of the image does not itself infringe. And replace that silkscreen with AI model, nothing has changed.


> Think of it as copying, compressing and storing all the copyrighted material in a database.

But it isn’t. It’s just a series of vectors that point to a likely occurrence of the next word or pixel or bit in a sequence.


You are trying to argue encoding semantics, but at the end of a day the "AI" was completely happy to recite Carmack's Fast inverse square root including original comments verbatim word for word.

https://twitter.com/StefanKarpinski/status/14109710611816816...


With the way these AI models work, that data isn’t stored in a database though.

It’s hard for people to understand this concept, but the fact that a model repeated some data verbatim is a happy coincidence (!) solely based on patterns of data that it seen before.

I think people have also have a hard time with how these models are trained. They are vacuuming up all sorts of data and learning from them by creating vectors that determine how follow-up data should be generated.

Sure, the original creators of this content aren’t being compensated or even recognized for it. I don’t have a good idea on how that should be handled.

For normal humans though, looking at art or reading a book, and later repeating some passage or drawing something from your own memory is not a crime. (Unless you’re sharing the DeCSS source code I guess…)

Slightly changing the topic here, but I do wonder what were to happen if someone wrote a program called “Monkeys on Typewriters” that just iterated through various combinations of characters (or bits or pixels) and was able to recreate things verbatim.

Is that random happenstance copyright infringement?


> For normal humans though, looking at art or reading a book, and later repeating some passage or drawing something from your own memory is not a crime.

False, actually; memorizing a copyrighted work and reproducing it other than in conditions specifically excepted from copyright protection is a violation of the exclusive rights of the copyright holder to make copies.

Copyright doesn't just apply to mechanical copies which don't have a human brain in the middle of the process.


Reciting common text or common license elements and commentary isn't necessarily copyright infringement.


You would never say ffmpeg stole a movie either...


> The idea here is that Stable Diffusion "memorised" the Getty logo, and could place it on outputs on demand.

> This is no longer possible as far as I can tell.

? Why would it no longer be possible?

Stability has exactly zero way of updating the weights once they have made them public. Is the suggestion that this was only possible on 1.4 and 1.5 and XL don't have the issue?

...or somehow what the model was previously capable of doing is now no longer possible?

That seems like enormously unsubstantiated speculation.

We have proof from multiple independent studies that these models in general memorize a small percentage of their training data to the degree that reasonable reconstructions of the original training images can be recovered from the model.

There is, to my knowledge, no mitigation of this that has been implemented, or even conceptualized by either stability or anyone else.

Interesting read regarding the actual ruling the judge made, but this is some "opinion here" commentary, which seem out of place; what the court actually rules is the interesting part of this.


I've memorized the Getty image and could reproduce it for you. Should I be lobotomized?

I think we need an entirely new approach to copyright. You can't copyright a brain, and weights are closer to that than they are pirated digital archives.


Stable Diffusion isn't a human being. It's not even sentient. It's a system designed to synthesize images, and it was trained using images authored by human beings - without the rightsholders' consent. This has nothing to do with lobotomies. You have rights as a human that a piece of software does not have.

I don't understand why I keep seeing people roll stuff out like 'should I be lobotomized' as some sort of attempt to distract from this just being a dispute over copyright minimalism/maximalism.


And the people training stabile diffusion shouldn't need the authors consent to train on their works.


Copyright has always made a distinction between the copies you make in a person's head and the copies you make into another work generated by a person.


> I've memorized the Getty image and could reproduce it for you. Should I be lobotomized?

Can I download you via a URL?

If not this analogy is completely meaningless. You and stable diffusion are just too different.


> Can I download you via a URL?

You can download an LLM trained on my writings. And I have a speech model, too.

It won't be much longer until we have brain scans of people that are turned into some form of generative system.


I’m not speculating about that.

I’m saying; if you’re covering a court case, don’t go off on a tangent and start saying things which are misleading or false with regard to the case.

> I think…

This is precisely my point. Most people don’t care what I think; or, you, or the OP on this topic.

Opinions are thick and fast.

What people care about is what the court rules.


>I’m saying; if you’re covering a court case, don’t go off on a tangent and start saying things which are misleading or false with regard to the case.

This adds the caveat "as far as I can tell". That's a caveat that the capability to produce Getty outputs has been removed, it can be tested and amended if new information comes to light.

>What people care about is what the court rules.

Courts also rely on experts and expert testimony, the judge here is even citing a law professor's opinion. The idea that judges never rely on legal experts and commentators is strange.


The complaint is not about pure storage/memorization.

If it produced images identical or highly similar to the copyrighted work, it would be similar to you making a copy and giving it to someone without paying royalties.

The article also shows an example of it reproducing their trademark.

Edit: Rephrased to be more clear.


It's not a brain. It's more like a very esoteric zip file which can extract a "lossy" version of the copyright material it was trained on, and reproduce that in an uncountable number of combinations that are meaningful to humans


> reproduce that in an uncountable number of combinations that are meaningful to humans

Then it is creative. It is doing meaningful work on top of the basic facts it learned. The facts and the style can't be copyrighted.


>There is, to my knowledge, no mitigation of this that has been implemented, or even conceptualized by either stability or anyone else.

I've tried with newer models, and you can't produce outputs with Getty logos, if you use "getty images" in prompts it won't work. Even if you type "Getty" you don't get the logo, I remember reading a tweet by Emad that any reference to Getty had been removed as of 1.5, but I can't find it anymore, maybe deleted now?

So yes, this is a thing, feel free to try it for yourself.

Emad


This is how a company whose failing business model collapsing fast would act. I never liked Stock photo sites since their inception - they always followed dark patterns, spammed the entire search results with keywords that had nothing to do with them. If you searched for something "free" they would show you images of something that was NOT free. But hey, "royalty free" is close enough, I guess?

They went on to sue Google Images that allowed the spirit of open web - where you could simply right click and save any image - from doing so.

And now that AI can pretty much produce stellar quality images with just a single line of text input, they don't have anything else to do other than drag everyone into litigation.

I'm a photographer myself (Sony A7C2/24-70mm G-Master) and I have learned to accept that this is going to be the future. And though there will always be a market for real photographers, AI will take over bulk of our jobs. And the programmer in me who spent ages trying to find the perfect background image for my website says that's not really a bad thing.


Initially photography itself was considered un-copyrightable "because it just reproduces a natural image". But later they came around and considered topic selection, composition, and other elements to be creative.


Maybe I've got the completely wrong end of the stick here, but why isn't an AI model treated as a fact, given it's essentially a factual summary of the most likely bit sequences to occur given an input sequence?


This argument feels like arguing that it's a fact that Game of Thrones first book consists of <this text>, thus <this text> (the entirety of the book) isn't copyrightable.

If the bit sequence is likely to occur because it's someone else's creative content (or part of it is)... that doesn't seem like it can be a 'fact' in the relevant manner.


What I'm wrangling with is this:

I agree that a particular sequence of words is copyrightable.

What I'm struggling with is that facts _about_ that corpus of text are not copyrightable. A simple fact could be that the word "bar" is the 5th word. The 6th word is "jazz". Etc.

A model is trained from these "facts" across many source documents. It is thus itself a derived 'fact' given a set of training inputs and parameters, so then how could _that_ then be copyrighted?

Put another way - there's the origin text and then.. is it turtles all the way down and none of it can be copyrighted because its all math and calculations derived from that?


Well for one thing creative works aren't "facts". If they were you could simply pirate any movie, because "facts" can't be copyrighted.


Because copyright has always tried to balance the idea of 'facts' against the idea of 'derivative works', even when they're blatantly in conflict with one another.


Very predictable and unsurprising. Stability knew they needed a license for training on Getty's copyrighted images given the presence of their watermark. OpenAI partnered with Shutterstock for that case with DALL-E to avoid this legal headache.

The fair use excuses here are absolutely weak in Stability's case and this will only end with a licensing deal being made.


>The fair use excuses here are absolutely weak in Stability's case and this will only end with a licensing deal being made.

There's no fair use in the UK, and this decision is preliminary on whether the case should proceed.


The equivalent concept is fair dealing https://www.gov.uk/guidance/exceptions-to-copyright


I know, I'm a lawyer. Fair dealing is closed-ended, and the TDM exception doesn't apply to StabilityAI as they're not a research institution, their best defence is that the training didn't take place in the UK, or that any copies were transient.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: