Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

wish I could change my terms to bar training of AI models on my content


if that is any consolation, no one gives a shit about xitter's ToS either. it will continue to be scrapped by every major player.


How exactly is it being scraped? My understanding is Twitter and LinkedIn are both huge pains in the ass to scrape right now.


There's a number of companies out there, like "brightdata", which pay a small amount to app developers to install a native "sdk". That SDK mimics a browser, and makes requests as if the user's device is doing it.

Since it's using a large number of real user's devices, and closely mimicing real web browsers, it ends up looking incredibly similar to real user traffic.

Since twitter allows some amount of anonymous browsing, that's enough to get some amount of data out. You can also pay brightdata for one large aggregated dataset.

https://bright-sdk.com/

This is part of the AI revolution, user's devices being commandeered to DDoS small blogs and twitter alike to feed data to the beast.


You can just not use Twitter?


I've been wondering if there's some way to put something into legally-defensible clickwrap around one's own content to deter or annoy misuse.

https://news.ycombinator.com/item?id=42774179

TLDR: Use contract law so that I provide my content and they give me rights to all outputs.

So if anybody doing this can prove Acme Model contains their artwork, and Acme Model was used to generate some scenes used in a major movie, then Acme has already given the artist a right to share/resell those scenes. If Acme Inc. "sold" exclusive rights to a movie-studio, then either (A) they broke the contact with every contributor, or (B) they lied to the studio in that other contract.

Remember, the goal isn't some amazing "gotcha" where the latest blockbuster movie becomes public domain, but rather to create chronic legal pain and risk for companies like Acme so that they stop stealing stuff.


Same here! It should be a default. Unfortunately, the very openness of the internet is now working against us.


Why should it be a default? Can you prove that training a model on data you wrote is not fair use?

We're already seeing precedent that it might be.

https://www.ecjlaw.com/ecj-blog/kadrey-v-meta-the-first-majo...

The openness of the internet is a good thing, but it doesn't come without a cost. And the moment we have to pay that cost, we don't get to suddenly go, "well, openness turned out to be a mistake, let's close it all up and create a regulatory, bureaucratic nightmare". This is the tradeoff. Freedom for me, and thee.


The burden is on the user to show that it is fair use, no? Not everyone else's responsibility to prove that it's _not_ fair use.


It is definitely the responsibility of anyone suing someone who trained a model on copyrighted data to prove that it isn't fair use, they have to show how it violated law, and while it's in the best interest of those organizations to make things easier for the court by showing why it is fair use, they are technically innocent until proven guilty.

Accordingly, anyone on the internet who wants to make comments about how they should be able to prevent others from training models on their data needs to demonstrate competence with respect to copyright by explaining why it's not fair use, as currently it is undecided in law and not something we can just take for granted.

Otherwise, such commenters should probably just let the courts work this one out or campaign for a different set of protection laws, as copyright may not be sufficient for the kind of control they are asking over random developers or organizations who want to train a statistical model on public data.


You've got it backwards. It's on the defendant to prove that their use is fair. The plaintiff has to prove that they actually own the copyright, and that it covers the work they're claiming was infringed, and may try to refute any fair-use arguments the defense raises, but if the defense doesn't raise any then the use won't be found fair.


It's true that the process is copyright strike/lawsuit -> appeal, but like I said, it's in their best interests to just prove that it's fair use because otherwise the judge might not properly consider all facts, only hear one side of the story and thus make a bad judgement about whether or not it is fair use. If anything, I'm just being pedantic, but we do ultimately agree here I think.


Well, lawsuits have multiple stages. First the plaintiff files the suit, and serves notice to the defendant(s) that the suit has been filed. Then there's a period where both sides gather evidence (discovery), then there's a trial where they present their evidence & arguments to the court. Each side gets time to respond to the arguments made by the opposing party. Then a verdict is chosen, and any penalties are decided by the court. So there's not really any chance the judge only hears one side of the story.

That said, I think we do agree. The plaintiff should be prepared to refute a fair-use argument raised by the defendant. I'm just noting that the refutation doesn't need to be part of the initial filing, it gets presented at trial, after discovery, and only if the defendant presents a fair-use defense. So they don't have to prove it's not fair use to win in every case. I'm probably also being excessively pedantic!


> It is definitely the responsibility of anyone suing someone who trained a model on copyrighted data to prove that it isn't fair use, they have to show how it violated law, and while it's in the best interest of those organizations to make things easier for the court by showing why it is fair use, they are technically innocent until proven guilty.

No, fair use is an affirmative defense for conduct that would otherwise be infringing. The onus is on the defendant to show that their use was fair.


Thank you, it seems I overstepped here in an effort to be precise. You're right that it is an affirmative defense.


> It is definitely the responsibility of anyone suing someone who trained a model on copyrighted data to prove that it isn't fair use

Morally, perhaps, but not under US law: https://en.wikipedia.org/wiki/Affirmative_defense#Fair_use


Yeah, I don't think downloading my paid-for books, from an illegal sharing site, to scrape and make use of, is in any way fair use.

From the decision in 1841, in the US (Folsom vs Marsh):

> reviewer may fairly cite largely from the original work, if his design be really and truly to use the passages for the purposes of fair and reasonable criticism. On the other hand, it is as clear, that if he thus cites the most important parts of the work, with a view, not to criticize, but to supersede the use of the original work, and substitute the review for it, such a use will be deemed in law a piracy

Further, to be "transformative", it is required that the new work is for a new purpose. It has to be done in such a way that it basically is not competing with the original at all.

Using my creative works, to create creative works, is rather clearly an act of piracy. And the methods engaged, to enable to do so, are also clearly piracy.

Where would training a model here, possibly be fair use?


Art is highly derivative for the most part, and artists are constantly learning from each other. The jury's out whether this applies to machines. Training an LLM on data is not the same as copying it. As such, the case right now against Meta is wholly focused on the acquisition part and not the training itself.


Meta's AI will quote some of my books in whole. So, yeah. That's copying.


We must separate the act of training from the act of distribution (which could include filtering). Training and personal use seems well within the scope of fair use.

I do however understand why you would be upset if Meta or OpenAI hosts/distributes a model that could fully reproduce your books (assuming that is really the case) and make money providing that information.

That said, and I'm not trying to move goalposts here, I just don't personally find Meta in particular to be morally at fault, as I already have particular views on the freedom for myself and others to share information with each other that may be incompatible with your views (and to be clear, as an artist and open source engineer I do have an informed personal opinion on this matter, I have deeply considered and continue to reconsider the balance of freedoms required for artists to make a living off their craft without infringing upon what I see as inalienable personal freedoms).

Meta released their models publicly and freely after investing a lot of time and money into them, and I see it as a net good for humanity to have access to these incredible neural networks that were relegated to science fiction just a few years ago.

I also think LLMs are going to force us to rethink our entire approach to copyright. Whether that means abandoning our current notions of copyright entirely, or creating residuals for hosted commercial LLMs, or something else, I don't know.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: