Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wish we had a constitutional amendment that opensourced all AI commercial AI models and requires documentation and links to all training data and base prompts.

They are trained on public data at our expense so We The People should *own* them.

Someday probably sooner then we might even think.... We'll easily run mega huge sized models on our laptops, desktops, and phones. AI should be free. Overhyped and Overpriced. I would love this setup for privacy and security.

Anyways, only tangentally related... (why worry about leaks like this and the hidden base prompts! - they *should all be 100% OSS* - it is the only way to ensure privacy and security).

Also, long timer lurker, first time posting!

I just had to get this off my mind! Cheers.



What you are describing is more-or-less a planned economy, the polar opposite of America's market economy. The government has the power to appropriate things for the common good because it's perceived that private enterprise isn't a necessary force. Sometimes it works, sometimes it doesn't; only certain countries can "moneyball" their way through economics like that, though. America has long since passed the point of even trying.

Your heart is in the right place here (I agree about FOSS), but there is a snowball's chance in hell that any of this ever happens in the USA. We'll be lucky if AI doesn't resemble cable TV by 2030.


There's nothing new about being able to copyright something that's a transformation of another work. And they definitely aren't exclusively trained on public data.


> There's nothing new about being able to copyright something that's a transformation of another work

There is something novel here.

Google Books created a huge online index of books, OCRing, compressing them, and transforming them. That was copyright infringement.

Just because I download a bunch of copyrighted files and run `tar c | gzip` over them does not mean I have new copyright.

Just because I download an image and convert it from png to jpg at 50% quality, throwing away about half the data, does not mean I have created new copyright.

AI models are giant lossy compression algorithms. They take text, tokenize it, and turn it into weights, and then inference is a weird form of decompression. See https://bellard.org/ts_zip/ for a logical extension to this.

I think this is the reason that the claim of LLM models being unencumbered by copyright is novel. Until now, a human had to do some creative transformation to transform a work, it could not simply be a computer algorithm that changed the format or compressed the input.


Google Books is not transformative. It shows you all the same data for the same purpose as they were published for.

A better example is Google Image Search. Thumbnails are transformative because they have a different purpose and aren't the same data. An LLM is much more transformative than a thumbnail.

It's more lossy than even lossy compression because of the regularization term; I'm pretty sure you can train one that's guaranteed to not retain any of the pretraining text. Of course then it can't answer things like "what's the second line of The Star Spangled Banner".


Google Books is transformative. It's a decided case. And it's the same as Google Image, i.e. for search.

https://news.ycombinator.com/item?id=45489807


Well yeah now it is, otherwise it wouldn't exist. I don't think showing the entire book would be though.


Thumbnails are not transformative, they are fair use. They would be copyright infringement, except that a court case ruled them as fair use: https://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com... .

The fact that compression is incredibly lossy does not change the fact that it's copyright infringement.

I have a lossy compression algorithm with simply outputs '0' or '1' depending on the parity of bits of the input.

If I run that against a camcording of a disney film, the result is a 0 copyrighted by disney, and in fact posting that 0 in this comment would make this comment also illegal so I must disclaim that I did not actually produce that from a camcorded disney film.

If I run it against the book 'dracula' the result is a 0 under the public domain.

The law does not understand bits, it does not understand compression or lossiness, it understands "humans can creatively transform things, algorithms cannot unless a human imbues creativity into it". It does not matter if your compressed output does not contain the original.


> The court held that framing and hyperlinking of original images for use in an image search engine constituted a fair use of Perfect 10's images because the use was highly transformative

?


You're missing something: whether or not it's copyright infringement depends on a) how much money you have and hence bribes you can give and b) whether you can say what you're doing is "to beat China".


Who exactly are you imagining is being bribed here?


> Google Books created a huge online index of books, OCRing, compressing them, and transforming them. That was copyright infringement.

No. It's a decided case. It's transformative and fair use. My understanding why it's transformative is that Google Books mainly offers a search interface for books and it also have measures to make sure only snippets of books are shown.


Unfortunately very unlikely in our forseeable future with the U.S. having a "U.S. against the world" mentality to the AI race. Would love to see this but this would get shot down immediately.


> I wish we had a constitutional amendment that opensourced all AI commercial AI models and requires documentation and links to all training data and base prompts.

> They are trained on public data at our expense so We The People should own them.

The people who appear to have been trained off for the interesting parts of the blog post are mostly, like me, not American.

> AI should be free. Overhyped and Overpriced. I would love this setup for privacy and security.

Also, this entire blog post only exists because they're curious about a specific free open-weights model.

The "source" being ~"the internet", which we've got as much access to as most of the model makers (i.e. where you don't, they've got explicit licensing rights anyway), and possibly also some explicitly* pirated content (I've not been keeping track of which model makers have or have not done that).

* as in: not just incidentally


> They are trained on public data

this is questionable, but okay...

> at our expense

?

> so We The People should own them.

in addition to training data, it is my understanding that a model's architecture also largely determines its efficacy. Why should we own the architecture?


Why would it require a constitutional amendment?


The takings clause of the fifth amendment allows seizure of private property for public use so long as it provides just compensation. So the necessary amendment already exists if they're willing to pay for it. Otherwise they'd need an amendment to circumvent the fifth amendment, to the extent the document is honored.


Are models necessarily IP?

If generative AI models' output can't be copyrighted and turned into private IP, who is to say the output of gradient descent and back-propagation similarly can't be copyrighted? Neither are the creative output of a human being, but both are the product of automated and computed statistical processes.

Similarly, if AI companies want to come at dataset compilation and model training from a fair use angle, would it not be fair use to use the same models for similar purposes if models were obtained through eminent domain? Or through, like in Anthropic's training case, explicit piracy?


It doesn't make sense to me that whether the result of intellectual effort is property or not depends on the legal status of its output, whether its production involved automation, or if it involved statistical computation. These look like vague justifications to take something made by someone else because it has value to you, without compensation.


I'm looking at this through the lens of US copyright, where the Copyright Office determined that AI output isn't protected by copyright, and thus isn't private IP, as it isn't the creative output of a human being.

If the results of inference and generation can't be protected under copyright, as they aren't the creative output of a human being, why wouldn't the results of back-propagation and gradient descent follow the same logic?

This isn't about how we feel about it, it's a legal question.


But things like logarithmic table books existed in a world where the results of the calculations were not protectable as IP, no matter how much effort went into creating them.


I'd settle with them being held in a public trust for public benefit


Wouldn’t the same argument then be applied to all scraped data?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: