Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Exploring 12M of the 2.3B images used to train Stable Diffusion (waxy.org)
446 points by detaro on Aug 30, 2022 | hide | past | favorite | 150 comments


Hi, laion5b author here,

Nice tool!

You can also explore the dataset there https://rom1504.github.io/clip-retrieval/

Thanks to approximate knn, it's possible to query and explore that 5B datasets with only 2TB of local storage, anyone can download the knn index and metadata to run that locally too.

Regarding duplicates, indeed it's an interesting topic!

Laion5b deduplicated samples by url+text, but not by image.

To deduplicate by image you need to have an efficient way to compute whether image a and b are the same.

An idea to do that is to compute an hash based on clip embeddings. A further idea would be to train a network actually good at dedup and not only similarity by training on positive and negative pairs, eg with triple loss.

Here's my plan on the topic https://docs.google.com/document/d/1AryWpV0dD_r9x82I_quUzBuR...

If anyone is interested to participate, I'd be happy to guide them to do that. This is an open effort, just join laion discord server and let's talk.


You are probably very aware of it, but just to highlight the importance of this for people who aren't aware: data duplication degrades the training and makes memorization (and therefore plagiarism, in the technical sense) more likely. For language models, this includes near-similarities, which I'd guess would extend to images.

Quantifying Memorization Across Neural Language Models https://arxiv.org/abs/2202.07646

Deduplicating Training Data Makes Language Models Better https://arxiv.org/abs/2107.06499 https://twitter.com/arankomatsuzaki/status/14154721921003397... https://twitter.com/katherine1ee/status/1415496898241339400


I have been using the rom1504 clip retrieval tool[0] up until now, but the Datasette browser[1] seems much better for Stable Diffusion users.

When my prompt isn't working, I often want to check whether the concepts I use are even present in the dataset.

For example, inputting `Jony Ive` returns pictures of Jony Ive in Datasette and pictures of apples and dolls in clip retrieval.

(I know laion 5B is not the same as laion aesthetic 6+, but that's a lesser issue.)

[0] - https://rom1504.github.io/clip-retrieval/

[1] - https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/im...


This is due to the aesthetic scoring in the UI. Simply disable it if you want precise results rather than aesthetic ones.

It works for your example

I guess I'll disable it by default since it seems to confuse people


Done https://github.com/rom1504/clip-retrieval/commit/53e3383f58b...

Using clip for searching is better than direct text indexing for a variety of reasons but here for example because it matches better what stable diffusion sees

Still interesting to have a different view over the dataset!

If you want to scale this out, you could use elastic search


I see, thanks! I didn't realize that as I thought I want to keep aesthetic scoring enabled since Stable Diffusion was trained on LAION-Aesthetics.

---

Also: There is a joke to be made at Jony's expense regarding the need to turn off aesthetic scoring to see his face.


Data is an excellent place to look at to get a sense of where the model is likely to work or not (what kinds of images), and for prompt design ideas because, roughly speaking, the probability of something working well is proportional to its frequency (or of things very similar to it) in the data.

The story is more complex though because the data can often be quite far away from actual neural net training due to preprocessing steps, data augmentations, oversampling settings (it's not uncommon to not sample data uniformly during training), etc. So my favorite place to scrutinize is to build a "batch explorer": During training of the network one dumps batches into pickles immediately before the forward pass of the neural net, then writes a separate explorer that loads the pickles and visualizes them to "see exactly what the neural net sees" during training. Ideally one then spends some quality time (~hours) looking through batches to get a qualitiative sense of what is likely to work or not work and how well. Of course this is also very useful for debugging, as many bugs can be present in the data preprocessing pipeline. But a batch explorer is harder to obtain here because you'd need the full training data/code/settings.


With big generative models, seeing data even once is more than sufficient to memorize it. So your claim that performance relates to frequency is not exactly correct.

The whole point of this model class is that one can learn one word from one sample, another pixel from another one and so on to master the domain. The emergent, non-trivial generalization is what makes them so fascinating. There is no simple, linear/first order relationship with data and behaviour. Case in point: GPT3 can do few-shot learning despite not having used any explicit few-shot formatted data during training.

Not saying you are wrong, but the story is not as simple as simple supervised learning with small datasets


What does that say about how these models will behave as an increasingly large portion of their training data is outputs from similar models? Our curation of the outputs will hopefully help. And if one image really is enough, perhaps the smaller number of human created images will be sufficient to inject new stuff rather than stagnating?


If anyone is interested in the technical details, the database itself is a 4GB SQLite file which we are hosting with Datasette running on Fly.

More details in our repo: https://github.com/simonw/laion-aesthetic-datasette

Search is provided by SQLite FTS5.


I notice a surprising number of duplicates. E.g. if I sort by aesthetic, there’s the same 500x500 Tuscan village painting multiple times on the first page of results.

Presumably it wouldn’t be so hard to hash the images and filter out repeats. Is the idea to keep the duplicates to preserve the description mappings?


I noticed this too, with the same description every time. How does this work in the model, does this give repeated images a bigger weight?

It's surprising that these weren't filtered out, and it would be interesting to know the number of unique images. (When it is mentioned that a model was trained on 10 billion images for example, obviously if each image is repeated 5 times then the actual number of images is 2 billions, not 10.)


The search speed is amazing!! Do you have to do a lot of pre-indexing to get it so fast?


It's SQLite's built in FTS index, nothing special on top of it. I built the index by running:

    sqlite-utils enable-fts data.db images text
https://sqlite-utils.datasette.io/en/stable/cli.html#configu...


Are you running on anything special compute-wise? I have a budget node running Mongo, it takes almost a second to fetch a single 1MB document.

Writing it out I realize it's not indexed by the attribute I'm retrieving by...


I started this on a Fly instance with 256MB of RAM and a shared CPU. This worked great when it was just a couple of people testing it.

Once it started getting traffic it started running a bit slow, so I bumped it up to a 2 CPU instance with 4GB of RAM and it's been fine since then.

The database file is nearly 4GB and almost all memory is being used, so I guess it all got loaded into RAM by SQLite.

I'll scale it back down again in a few days time, once interest in it wanes a bit.


But MongoDB is webscale!


Use the index, Luke!


Is there plans to expand the model to be even larger?


It is interesting that there are at least a few images in the dataset that were generated by previous diffusion methods:

https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/im...

(Surely there are many more that don't have this specific label).


Turtles all the way down


"The most frequent artist in the dataset? The Painter of Light himself, Thomas Kinkade, with 9,268 images."

Oh that's why it is so good at generating Thomas Kinkade style paintings! I ran a bunch of those and they looked pretty good. Some kind of garden cottage prompt with Thomas Kinkade style works very well. Good image consistency with a high success rate, few weird artifacts.


I've noticed an affinity towards "The Greats" of modern painting. I've gotten incredible results from using Dali, Picasso, Bacon, Lichtenstein, etc. My luck with slightly-less-known artists of similar styles doesn't work as well (eg Braque, Guayasamín, or Gris, as opposed to Picasso)


Can you come up with a prompt that will reproduce some of his painting almost exactly?


Not sure.


I’m curious about this. Can it plagerize? Can GPT3? Why or why not?


I always had the crazy idea of "infinite entertainment": somehow we manage to "tap" into the multiverse and are able to watch TV from countless of planets/universes (I think Rick and Morty did something similar). So, in some channel at some time you may be able to see Brad Pitt fighting against Godzilla while the monster is hacking into the pentagon using ssh. Highly improbable, but in the multiverse TV everything is possible.

Now I think we don't need the multiverse for that. Give this AI technology a few years and you'll have streaming services a la Netflix where you provide the prompt to create your own movie. What the hell, people will vote "best movie" among the millions submitted by other people. We'll be movie producers like we are nowadays YouTubers. Overabundance of high quality material and so little time to watch them all. Same goes for books, music and everything else that is digital (even software?).


I think much more likely, at least this side of the singularity, what we'll have is infinite dreck. And for some people, that will be enough.

It's extremely hard to make good content. Teams of extremely skilled, well-paid people, even ones who have succeeded before, fail regularly. And that's with complicated filtering mechanisms and review cycles to limit who has access and keep the worst of it from getting out.

But not everybody needs everything to be actually good. My partner will sometimes unwind on a Friday by watching bad action movies; the bad ones are in some ways better, as they require less work to understand and are amusing in their own way. Or there's a mobile game I play when I want to not think, where you have to conquer a graph of nodes. The levels are clearly auto-generated, and it's fine.

I think that kind of serviceable junk is where we might see AI get to in a couple of decades, made for an audience that will get a weed gummy and a six pack and ask for a "sci-fi action adventure with lots of explosions" and get something with a half-assed plot, forgettable stereotypical characters and visuals that don't totally make sense, but that's fine. You won't learn anything, you won't be particularly moved, and you won't ever watch it again, but it will be a perfectly cromulent distraction between clocking out and going to bed.


Whenever I think of AI and its implications, I find it useful to think of our current version of the AI: The Market. Its profit-maximizing function is the canonical paperclip maximizer.

We are going to absolutely drown in crap. Just as we do now, it's just going to flood the internet at an unimaginable pace. We'll probably train AIs to help us find stuff and tell what's real/true/etc; it's going to be a heck of an arms race.

It's going to be one hell of an arms race.


> So, in some channel at some time you may be able to see Brad Pitt fighting against Godzilla while the monster is hacking into the pentagon using ssh.

Don't know if this is sarcasm, if it is, ignore the rest of the comment.

Honestly, it sounds terrible.

Good shows are well written, coherent and, most of all, narrow in scope.

If an AI can write the next Better Call Saul, great.

Randomly patching together tropes sounds more like kids drawings, that are interesting on an artistic point of view, maybe, given their limited knowledge of reality and narrative, but terribly boring and confusing as a form of entertainment.

Unless the audience is kids, they love that stuff, for reasons we don't understand anymore as we grow up.


Did you read the whole post? Parent post was talking about assisted movie generation where a human is making a movie and using the AI as a tool to make the content. This will absolutely be an enormous thing in the next 10, 20 years and it will lead to a creative revolution in the same way that youtube did - entire genres that do not currently exist will come to fruition by lowering the barriers to entry to a huge number of creators.

I don't have any trouble finding youtube channels that I like to watch and ignoring the rest, and I suspect I won't have any trouble finding movies generated using AI as a production tool that I want to watch either.


> I don't have any trouble finding youtube channels that I like to watch and ignoring the rest, and I suspect I won't have any trouble finding movies generated using AI as a production tool that I want to watch either.

Not really the same - there is a range from good to bad on youtube, because real people are adding the creative spark. There is no reason to suspect AI will generate such a range, and its unclear we will ever get to the point where AI can do "creativity" by itself.


Exactly. I'm sure we'll see a lot of AI-assisted production. But AI-originated and high quality? I don't think I'll see it in my lifetime. (I do expect though, that we'll see people claiming works as AI-created, as the controversy will be stellar marketing.)


> Did you read the whole post

as per HN rule, don't ask this question.

> Parent post was talking about assisted movie generation where a human is making a movie and using the AI as a tool to make the content

which is exactly the problem we don't have.

there are thousands of scripts written everyday that never see the green light.

> and it will lead to a creative revolution

it won't

main reason why content is not produced is money.

unless you find a way to create an infinite supply of money and an infinite amount of paying audience for that content, more content is a problem, not a solution.

> I don't have any trouble finding youtube channels that I like to watch and ignoring the rest,

so what's the problem?

There's already infinite content out there, what does the "AI" brings to the table that will make any difference, other than marketing, like 3D movies?

Have you watched any 3D movie recently?


It sounds like you are arguing FOR the GP's idea.

"there are thousands of scripts written everyday that never see the green light." "main reason why content is not produced is money."

So if there are plenty of ideas and not enough money, and you could put those ideas into a box and spit out a movie that would normally cost millions, that's good right?


> So if there are plenty of ideas and not enough money,

big chunk of the budget is spent on marketing.

if you produce something that nobody watch, it's like the sound of the tree falling where nobody can hear it.

if you know how to use AI to cut that cost, I'm all ears.

Also: Al Pacino will want his money if you use his name, even if he is not actually acting in the movie.

Reality is that there's plenty of ideas, true, that would not make any money though.

Studios don't like to work at loss.

Rick and Morty costs 1.5 million dollars per episode and from what we've heard from director Erica Hayes, a single episode takes anywhere between 9 and 12 months to create from ideation to completion


>if you know how to use AI to cut that cost, I'm all ears.

Cutting costs seems to be the main reason AI is being explored. If you go to studio asking for budget to create a movie and predict "10 000 people will watch it", they will laugh in your face. If one person with the help of AI can make the movie and 10 000 people will watch it, its a win for everyone involved.

I dont see youtube channels having enormous budgets for marketing, yet they find sizeable audience and make profit still. Once you lower the cost of production, you dont need huge marketing budgets to secure profits.


> I dont see youtube channels having enormous budgets for marketing,

Because they mainly support one person.

You don't need a big budget to sell lemonade on the street, you can make a salary out of it, doesn't mean you have become a tycoon or have revolutionized the lemonade stand industry.

> I dont see youtube channels having enormous budgets for marketing

Have you seen those ADS every 15 seconds?

That's the marketing budget, the whole YouTube ADS revenue is the marketing budget.


> if you know how to use AI to cut that cost, I'm all ears.

A social media account with a couple million followers i.e. https://www.instagram.com/lilmiquela/?hl=en

> Also: Al Pacino will want his money if you use his name, even if he is not actually acting in the movie.

The creator doesn't need Al Pacino. I'm not following this one. Rick and Morty doesn't need Angelina Jolie

> Reality is that there's plenty of ideas, true, that would not make any money though.

Plenty of websites, 99.999% of them don't make any money. I definitely still think the easier it is to make a website the better.

> Rick and Morty costs 1.5 million dollars per episode and from what we've heard from director Erica Hayes, a single episode takes anywhere between 9 and 12 months to create from ideation to completion

If that includes marketing and advertising, that's amazingly cheap. Cut the creation time down to a few weeks, throw in a few product placements, post it on your social media/youtube/etc and you have your own movie studio.


There is a lot more that goes in to it than an idea. Like that dude who has a great app "idea" and just needs someone to implement it, but is surprised nobody takes them up on it.


Way back in 2019 someone use deepfake tech at the time to change what new lion king could look like. This is what i think some of us are thinking.

http://geekdommovies.com/heres-what-the-live-action-lion-kin...

I have a decently old LG 3d tv that can actually turn 2d into 3d, and its actually a lot of fun to watch certain stuff in 3d mode.


The point of the quoted remark is to illustrate the exotic possibilities of “Multiverse TV”. It’s not an example of quality content you /would/ watch, it’s merely some content you /could/ watch. Multiverse TV has everything, from Better Call Saul to zany strings of tropes.


The point is that nonsense "multiverse TV" has been a thing since I can remember watching TV.

Endless entertainment it's already there, you can't watch it simply because most of the content you're talking about is not on air and streaming platforms don't buy it, because it's shit.

Not that I don't like shit, I've watched more troma/SyFi/random low budget Asian movies (I am a huge fan of ninja movies) than necessary, but if we stop to think that there are already 6 or 7 sharknado movies (that are exactly the kind of endless entertainment you talk about, they are probably generated in some way), maybe it's not the volume of content that's missing, but probably content that's worth watching.


> now well written, not coherent, and broad in scope.

> randomly patched together tropes

Sounds like a dream, not as in what I wish for, but what I experience at night.

So, if you think about it like a tool to enable a form of lucid dreaming, it may be something interesting.

Of course you have to find a way to get for your brain what you want to see in "real time", but I think we will get there.


> So, if you think about it like a tool to enable a form of lucid dreaming, it may be something interesting.

we usually call that tool psychedelic drugs.

there are devices being developed for that purpose, I don't think they will ever be reliable, AI is not necessary for that.

On the philosophical implications of lucid dreams

https://en.m.wikipedia.org/wiki/Waking_Life


I think a key piece here is missing.

txt2img is quite limited and img2img is really where the power is. With a little intermittent guidance hand of a human.

What took 100s or 1000s of people to write, act, record, post process, Better Call Saul might be doable by a a team 1/100th the size possibly even to a single individual. Which means while it might not just instantly spit it out. I'd just like youtube, there will be an incredible amount of great content to watch, far more than anyone could ever realistically watch.

And of course there will be lots of utter trash as well.

But if it took 100 people to make better call saul and now 100 individuals can make 100x different "better call sauces"


Sturgeon’s law will probably end up at 99%


I think you are forgetting that a good portion of social media users are used to short term content.


I believe there's already a lot more content on social media than time to watch it in 100 lives.


But if you want to see something specific, it's often not there unless you generate it.


> But if you want to see something specific, it's often not there unless you generate it.

Example?

I don't think I've ever wanted to watch something that only a computer could generate.


> Overabundance of high quality material and so little time to watch them all.

If a tree falls down in a forest, and there is no one there to hear it. Does it make a sound?

If you can generate infinite material, how do you judge quality?

You're extrapolating an idea based on what movies are, fundamentally. But you don't take in consideration what movies are not. Watching a movie is also a social experience. Going to the movie theater, waiting years for a big blockbuster title, watching something with friends. Word of mouth recommendation is a very big thing. If a close friend recommends me something (be it a movie or a book), I'm much more inclined to like it just for the human connection it provides (reading or watching something other people enjoyed is a means of accessing someone else's psyche).

If every time you watch a movie you have the knowledge that there is a movie that is slightly better a prompt away, why bother finishing this one? If you know you probably won't finish the movie you generated, why bother starting one? So what do you do? You end up rewatching The Office.

Sure, if you tell me this will be possible in a couple of years, I won't object. The point is: will you pay for it on a recurring basis? Because if you don't, this will be no more than a very cool tech project.

----

I've recently had this idea for a sci-fi book: in a future not so distant, society is divided between tech and non-tech people. Tech people created pretty much everything they said they would create. AGI, smarter-than-human robots, you name it. But, it didn't change society at all. Companies still employ regular humans, people still watch regular made-by-human movies and eat handmade pizzas and drink their human-made lattes in hipster coffeshops. So tech people are naturally very frustrated at non-tech people, because they're not using optmizing their lives and business enough. And then you have this awkward situation where you have all these robots with brains of size of a galaxy lying around, doing nothing. And then some of them start developing depression, for spending too much time idle. And then the tech people have to rush to develop psychiatric robots. And then some robots decide to unionize, and others start writing books about how humans are taking jobs that were supposed to be automated.


I think a more interesting thought is how entertainment boils down to a sequence of 1's and 0's... and if we could somehow get enough of that sequence right, we could unveil video/audio/images of real people doing things they've never done - such as Brad Pitt fighting Godzilla.

Imagine uncovering a movie that was never made but featured actors you know. If Steven Spielberg can make the movie, then there is an undiscovered sequence of 1's and 0's that already is that movie, a sequence that could be discovered without actually making the movie. Imagine "mining for movies"...

Of course that sequence is likely impossible to ever predict enough of to actually discover something real... but it's a fun thought experiment.


If you like thinking about that sort of thing, and haven't read it yet, check out the short story "The Library of Babel" by Jorge Luis Borges


Isn't everything representable in a digital form? I think we're in the very early era of entertainment becoming commoditized to an even higher degree than it is now.

I envision exactly the future as you describe: Feed a song to the AI, it spits out a completely new, whole discography from the artist complete with lyrics and album art that you can listen to infinitely.

"Hey Siri, play me a series about chickens from outer space invading Earth": No problem, here's a 12 hour marathon, complete with a coherent storyline, plot twists, good acting and voice lines.

The only thing that is currently limiting us is computing power, and given enough time, the barrier will be overcome.

A human brain is just a series of inputs, a function that transforms them, and a series of outputs.


"Highly improbable, but in the multiverse TV everything is possible."

Quick reminder, there are infinitely many even numbers and none of them are odd.

A given infinite (or transfinite) set does not necessarily contain all imaginable elements.


Ok... I don't think anyone would expect an infinite supply of water to also include vodka and wine. He specifically said "infinite TV".


Does any universe have a "just a completely dark room" channel? How many dark room channels are there?

Is there a universe with channels focusing on these subjects:

"Video of the last tears of terminal cancer patients as Jerry Lewis tells jokes about his dick"

"This guy doesn't like ice cream but he eats it to reassure his girlfriend that he isn't vegan"

"A single hair grows on a scalp"

"Infant children read two-century-old stories about shopping for goose-grease and comment on the prosody"

"Gameshows where an entire country's left-handed population guesses how you'll die"

This is the whole point of the Rick and Morty cable bit. There are things that would not be on TV in any universe that invents TV. It's hilarious to pretend they would be.


It's clear you have a misunderstanding of infinity.

In a show, there are a certain number of frames. In one frame, there are a certain number of pixels. Each pixel can be one of some number of colors. An infinite TV would be able to show every combination of every color of pixels, follow by every combination of frames, simultaneously and forever. All those shows are in there. Not only that, but all of this is also countably infinite.


That's not a multiverse cable situation and there are way more kinds of infinity than just countable and uncountable.

In a multiverse situation you're watching actual content from an infinity of universes where beings exist who have produced and selected that content.

You're not watching an infinite amount of static and magically selecting the parts of static that are coincidentally equivalent to specific content.

All multiverse content must be watchable and creatable by the kind of creature that creates a television. So content that is unwatchable or uncreatable by any conceivable creature will not be in that infinite set.

It is very easy to describe impossible content and any impossible content will not be on the multiverse TV.

Trivial counterexamples that are describable but uncreatable, in any universe similar enough to ours that it has television:

-a channel that reruns actual filmed footage of a given universe's big-bang will not exist.

-a channel that shows *only* accurate, continuous footage of the future of that universe.

-a channel that shows the result of dividing by zero.

Channels that may or may not be uncreatable:

-a channel that induces in the viewer the sensation of smelling their mother's hair.

-a channel that causes any viewer to eat their own feet.

-a channel that cures human retinal cancer.

-a channel that shows you what you are doing, right now, in your own home in this universe, like a mirror. Note that this requires some connection between our universe and the other universe and there's no guarantee, in a multiverse situation, that connections between the universes are also a complete graph.

These examples are more important. We know they are either possible or impossible but we do not know which. Just saying "multiverses are infinite" doesn't answer the question.

For further reading, review https://en.wikipedia.org/wiki/Absolute_Infinite and remember that a channel is a defined set.


Don't forget to posit an infinite curator who watches every channel on the infinite TV and responds to queries for channels with something sensible, not a random garble of pixels.


The Library of Babel, now only $19.95 per month!


Sounds like the "interdimensional cable" bits on Rick and Morty. Reportedly the co-creators would get back-out drunk and improvise the shows. My favorite is house hunters international, where ambulatory houses are being hunted by guys with shotguns.

https://screenrant.com/rick-morty-interdimensional-cable-epi...


It does feel like we could at least get procedurally generated streaming music. Music is limited enough that it feels possible, and people are more than willing to rate and save music they like. The social element of raiding another person's curated playlist could take over the romantic notion of an artist's personal expression. Curating such lists of procedurally generated music could make everyone a musician.


See this recent thread for more on this == https://news.ycombinator.com/item?id=32559119


There are a handful of TV series canceled too soon that of love to see new episodes of. There are tremendous issues of consent involved, but it’s a nice dream to have. It’s a slippery slope to be sure, but imagine taking your well-honed fanfic script, loading it in a system and getting a ready-to-watch episode never before seen.


Anyone have an idea of how many orders of magnitude advancement we are away from this? Like, it takes a high end GPU a non-trivial amount of time to make one low resolution image. A modern film is at least 30 of those a second, at far higher pixel density. It seems like you get a 10x improvement in GPU perf every 6-8 years[1], and it might take more than a couple of those.

Plus you need to encode the ideas of plots/characters/scenes, and have through-lines that go multiple hours. It seems like with the current kit it’s hard to even make a consistent looking children’s book with hand picked illustrations.

My gut is we are more than a few years off, but maybe I’m underestimating the low hanging fruit?

1: https://epochai.org/blog/trends-in-gpu-price-performance


> So, in some channel at some time you may be able to see Brad Pitt fighting against Godzilla while the monster is hacking into the pentagon using ssh.

that's the movie GAN, more interesting would be the zero-shot translations of epic foreign/historic films into your native culture


Reminds me just a bit of the Culture series, where in the distant future computing power is essential infinite. In the series, many of the great AIs of unfathomable intellect spend their free time on the "Infinite Fun Space," which is simulating universes with slightly different starting conditions and physical laws.


I’m curious how much this prevents new styles from emerging though, rather than rehashing of things that already exist.


The model was trained on 2.3B images, but how many TV shows are there to train it on?

There are quite a few books written, so maybe transfer learning from that?


> how many TV shows are there to train it on?

None, according to the MPAA.


ML training already involves scraping copyrighted content. I'm sure big tech megacorps would fight any lawsuits they receive.


> Strangely, enormously popular internet personalities like David Dobrik, Addison Rae, Charli D’Amelio, Dixie D’Amelio, and MrBeast don’t appear in the captions from the dataset at all

Self awareness here would have lead to the removal of "enormously popular".


This has been an incredibly helpful tool today exploring Stable Diffusion.

I'm starting to realize Stable Diffusion doesn't understand many words, but it's hard to tell which words are causing it problems when engineering a prompt. Searching this dataset for a term is a great way to tell whether Stable Diffusion is likely to "understand" what I mean when I say that term; if there are few results, or if the results aren't really representative of what I mean, Stable Diffusion is likely to produce garbage outputs for those terms.


Huh, there's a ton of duplicates in the data set... I would have expected that it would be worthwhile to remove those. Maybe multiple descriptions of the same thing helps, but some of the duplicates have duplicated descriptions as well. Maybe deduplication happens after this step?

http://laion-aesthetic.datasette.io/laion-aesthetic-6pls/ima...


Per the project page: https://laion.ai/blog/laion-400-open-dataset/

> There is a certain degree of duplication because we used URL+text as deduplication criteria. The same image with the same caption may sit at different URLs, causing duplicates. The same image with other captions is not, however, considered duplicated.

I am surprised that image-to-image dupes aren't removed, though, as the cosine similarity trick the page mentions would work for that too.


I assume having multiple captions for the same image is very helpful actually.


Scrolling through the sorted link from the GP, there are a few dupes with identical images and captions, so that doesn't always work either.


Isn't it really expensive to dedupe images based on content? As you have to compare every image to every other image in the dataset?

How could one go about deduping images? Maybe using something similar to rsync protocol? Cheap hash method, then a more expensive one, then a full comparison, maybe. Even so 2B+ images... and you are talking about saving on storage costs, mostly which is quite cheap these days.


It depends on exactly what problem you're trying to solve. If the goal is to find the same image with slight differences caused by re-encoding, downsampling, scaling, etc. you can use something like phash.org pretty efficiently to build a database of image hashes, review the most similar ones, and use it to decide whether you've already “seen” new images.

That approach works well when the images are basically the same. It doesn't work so well when you're trying to find images which are either different photos of the same subject or where one of them is a crop of a larger image or has been modified more heavily. A number of years back I used OpenCV for that task[1] to identify the source of a given thumbnail image in a larger master file and used phash to validate that a new higher resolution thumbnail was highly similar to the original low-res thumbnail after trying to match the original crop & rotation. I imagine there are far more sophisticated tools for that now but at the time phash felt basically free in comparison the amount of computation which OpenCV required.

1. https://blogs.loc.gov/thesignal/2014/08/upgrading-image-thum...


No, you embed all images with CLIP and use an approximate nearest neighbour library (like faiss) to get the most similar ones to the query in logarithmic time. Embedding will also be invariant to small variations.

You can try this on images.yandex.com - they do similarity search with embeddings. Upload any photo and you'll get millions of similar photos, unlike Google that has only exact duplicate search. It's diverse like Pinterest but without the logins.

Query image: https://cdn.discordapp.com/attachments/1005626182869467157/1...

Yandex similarity search results: https://yandex.com/images/search?rpt=imageview&url=https%3A%...


You don't have to compare all images to one another and doing so wouldn't reliably dedupe - what if one image is slightly different resolution, different image type, different metadata, etc? They would different hashes but still basically the same data.

I think the way you do it is to train a model to represent images as vectors. Then you put those vectors into a BTree which will allow you to efficiently query for the "nearest neighbor" to an image on log(n) time. You calibrate to find a distance that picks up duplicates without getting too many non-duplicates and then it's n log(n) time rather than n^2.

If that's still too slow there is also a thing called ANNOY which lets you do approximate nearest neighbor faster.


There are hash algorithms for image similarity. For a toy example, imagine scaling the image to 8x8px, making it grayscale, and using those 64 bytes as hash. That way you only have to hash each picture once, and can find duplicates by searching for hashes with a low hamming distance (number of bit flips) to each other, which is very fast.

Of course actual hash algorithms are a bit cleverer, there are a number to choose from depending on what you want to consider a duplicate (cropping, flips, rotations, etc)


But even a single pixel being one shade brighter would make one hash completely different, that's the point of hashes.


You're probably confusing "cryptographic hash functions" with "perceptual hashing" (or other forms of "locality-sensitive hashing"). In the case of the latter, what you say is almost always not true (that's the point of using "perceptual hashing" after all: similar objects get mapped to similar/same hash).

See: https://en.wikipedia.org/wiki/Perceptual_hashing


I don't have experience with image duplication, but if you can make a decent hash a 2.3 billion item hashtable is really cheap.

If you need to do something closer to pairwise (for instance, because you can't make a cheap hash of images which papers over differences in compression), make the hash table for the text descriptions, then compare the images within buckets. Of the few 5 or 6 text fields I just spot checked (not even close to random selection) the worst false positive I found (in the 12M data set) was 3 pairs of two duplicates with the same description. On the other hand I found one set of 76 identical images with the same description.


Convert the images to embeddings, and perform an approximate nearest neighbor search on them and identify images that are very close together (e.g. with faiss, which the page alludes to using).

It's performant enough even at scale.


At a minimum a hash should be computed for each image and dupes removed. I haven't read the paper so they might have already done so.


The next internet crawl is going to have thousands of (properly labeled) AI generated images. I wonder if that could throw these algorithms into a bad feedback spiral. Though I guess there are already classifiers that can be used to exclude AI generated images. The goal is to profit off of free human labor after all.


> The goal is to profit off of free human labor after all.

this is too inflammatory in my opinion.

- some works are in public domain

- tech companies have profited from creators for a long time. I'm sure some arrangement could be made for profit sharing for artists who care about money, but it's too early for that (no profits, I'm sure most companies are losing money on AI art)

- some artists care about art or fame more than money. Their art will not be devalued by AI, if anything, constant usage of their names in prompts is going to make them massively popular and direct people to source material or merch, which they may buy.

- some artists are dead and don't care anymore. Their "estate" vulnerable to takeovers by "long lost but recently found" relatives, who don't care about art itself, only about money. Many such stories.

One example, albeit not in paintings but in music is Jimi Hendrix Estate. They used to do copyright strikes on YouTube in order to remove fan-made compilations of rare material (cleaned up sound of live concerts, multiple sources mixed into one etc.), without intentions to ever release an alternative.


They need a feature that for a given generation, shows the nearest image in the training set. It is clearly doing more than “memorizing”, but for some “normal” queries, how do you know the output isn’t very similar to some training image? That could have legal implications.


I don't think anyone has a definition of "nearest" that could accomplish that in general. Comparing pixel data is easy, but comparing the subjects portrayed and how is much harder to pin down.


You could try the reverse search in a images search engine. Google or Bing support that for example.


How would one go about adding more data into the dataset?

Would one need to retrain the entire dataset? Or is there typically a way to just add an incremental batch?


This is quite interesting. This really makes me wonder how much of the differences between Stable Diffusion, Dall-e 2 and MidJourney are due to different architectures and training intensity and how much is due to different datasets.

For example Stable Diffusion knows much better than MidJourney what a cat looks like, MidJourney knows what a Hacker Cat looks like, while Stable Diffusion doesn't (you can tell it to make a cat in a hoodie in front of a laptop, but it won't come up with that on its own). Meanwhile for landscapes Stable Diffusion seems to have no problem with imagination. How much of that is simply due to blindspots in the training data?


Why not just add a shortcut to search for the related images in these artist count table etc.?

Also, a UI issue: the sorting arrow feels wrong:

https://i.imgur.com/uyUXAXy.png

The norm is that when the arrow is pointing down, the data are currently sorting descendingly (I'm aware you can interpret it as "what will happen if you click it", but the norm is to show what currently is using.)


That's the exact problem: should the button show what will happen, or what is currently happening?

When I designed that feature I looked at a bunch of systems and found examples of arrows in both directions.


At least Excel and Google Sheets both display it the way I described. Adding Apple Numbers, which I never use unfortunately, I think that should cover like 90% of actual use cases to be considered as "convention".


That inclusion of Mickey in the model is waving a red flag in front of an impressive bull.


I was just thinking yesterday if Mockey The Rat would fly as homage/deriv mickey has some good 70 years already no? Copyright will die at the hands of ML am afraid


As expected, very few NSFW images were included in the training set, according to this. They are more afraid of showing a penis than showing Mickey Mouse.


> but often impossible with DALL-E 2, as you can see in this Mickey Mouse example from my previous post

> “realistic 3d rendering of mickey mouse working on a vintage computer doing his taxes” on DALL·E 2 (left) vs. Stable Diffusion (right)

Well, but the mickey mouse in the right isn't "realistic", or even 3D. It's straight up just a 2D Mickey image pasted there.


My fundamental and maybe dumb question is: when is artificial intelligence / ML going to get smarter than needing a billion images to train?

Sure, the achievements of ML models lately are impressive, but it's so slow in learning. We are brute forcing the DNNs it feels to me, which is not something that smacks of great achievement.

You and I have never seen even 100,000 photos in our lives. Well, maybe the video stream from our eyes is a little different. But it's not a billion fundamentally different images.

Is there anything I can read about why it is so slow to learn? How will it ever get faster? What next jump will fix this, or what am I missing as a lay person?


> You and I have never seen even 100,000 photos in our lives

If someone's 30, that'd only require seeing 10 images a day. For most people that quota is probably fulfilled within a couple of minutes of watching TV or browsing social media, even if the video stream from our eyes otherwise counts as nothing.

We've also had about 4 billion years of evolution, slowly adjusting our genome with an unfathomable amount of data. Gradient descent is blazing fast by comparison.


Add to that a video now a days is at least 24fps. So a two hour movie suffices haha :)


A couple of things:

* We have seen more than 100,000 "photos" in the sense that photos are just images - if photos are just images, we have a constant feed of "photos" every single moment our eyes are open. Of course, that's not the same as these training datasets, but it is still worth keeping in mind.

* All of these things trained on massive datasets with self-supervised learning are in a sense addressing the "slowness" of learning you mention, since self-supervised (aka no annotations are needed beyond the data itself) "pre-training" on the massive datasets can then enable training for downstream tasks with way less data.

* Arguably requiring massive datasets for pre-training is still a bit lame, but then again the 4-5 years of life it takes to reach pretty advanced intelligence in humans represents a whoooole lot of data. And as with self-supervised learning on these massive models, a lot of intelligence seems to come down to learning to predict the future from sensory input.

* Humans also come with a lot of pre-wiring done by evolution, whereas these models are trained from scratch. Evolutionary wiring represents its own sort of "pre-training", of course.

So basically, it is not so slow to learn as it seems. Arguably it could get faster once we train multimodal models and concepts from text can reinforce learning to understand images and so on, and people are working on it (eg GATO). There may also need to be a separation between low level 'instinct' intelligence and high-level 'reasoning' intelligence; AI still sucks at the second one.


Yeah. Humans take a long time to train. We spend years and years, starting at birth, just absorbing everything around us before we get to a point where we're considered adults.


Yet how do we do it with CPUs that are even less power consuming than ARM chips, in our heads?


Because believe it or not matrices are not brains.

People need to get over the metaphors. If you want to spend your time learn about the mathematics under the hood, there will be less "mysteries" then.


Human brains are also pre-trained at birth, on faces and a whole bunch of other things.


If we consider a frame rate of 60 FPS then a 5 year old would have seen about ~ 6.3 billion images [60 (frames) * 60 (seconds) * 60 (minutes) * 16 (waking hours) * 365 (days) * 5 (years)]. Even with 30 FPS you can halve the number and it's still a huge number.

A cool fact is that this model fit ~5B images in 900M parameter model which is tiny compared to the size of the data.


I guess what I find really interesting is, how come we can start to self-label data we encounter in the wild, yet the DNN needs data to constantly be labeled at the same intensity per image, into the billions?


If you look at the text labels for the data used by Stable Diffusion you'll find that they are very low quality. Take a look at some here:

https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/im...

Clearly quality of labeling isn't nearly as important once you are training on billions of images.


This is only true for multi modal learning, but yeah, in that case we need text and image pairs. More than likely it's possible to pretrain image and language separately, and then use a vastly smaller number of pairs. But that's hypothetical.


Textual Inversion is about to flip that around (if I'm understanding the paper correctly).

https://textual-inversion.github.io/


Most of us have our eyes open, looking at things ~16 hours a day, with the first 18 years of our lives heavily focused on learning about what those things are, plus we have the extra brain capacity to remember those things, and to think about them in an abstract manner. My entire photo library alone is over 100,000 photos - and since I took all of them I will have "seen" them.


> You and I have never seen even 100,000 photos in our lives. Well, maybe the video stream from our eyes is a little different. But it's not a billion fundamentally different images.

I would argue precisely the opposite (as you alude to), it's more than 100s of billions of fundamentally (what does this even mean?) different images. Calculate the frequency at which your eyes sample, think of the times the angle changes (new images), multiple by your age, multiple by 2 for two eyes looking slightly different directions, factor in your the noise your brain has in forming the image "in your head" because you drank too much... you can continue adding factors (hours of "TV" watched on average) add-naseum.

It seems that "slow to learn" has a "real" target/bound, what humans are capable of. If it takes Bob Ross decades to paint all the "images" in his head, then maybe we should go easy on the algorithims?


I recommend looking into "transfer learning".

That's where you start with an existing large model, and train a new model on top of it by feeding in new images.

What's fascinating about transfer learning is that you don't need to give it a lot of new images, at all. Just a few hundred extras can create a model that's frighteningly accurate for tasks like image labeling.

This is pretty much how all AI models work today. Take a look at the Stable Diffusion model card: https://github.com/CompVis/stable-diffusion/blob/main/Stable...

They ran multiple training sessions with progressively smaller (and higher quality) images to get the final result.


Some experiments have demonstrated that being involved in the 'generation' of the visual data (i.e. choosing where to move, where to look, how to alter reality) gets significantly better learning than passively receiving the exact same visual data.

Active learning is a good way to improve sample efficiency - however, as others note, don't underestimate the quantity of learning data that a human baby needs for certain skills even with good, evolution-optimized priors.


I think the thing that’s missing is that the AI can’t train itself. If you were asked to draw a realistic x ray of a horse’s ribcage, you’d probably google image search, do some research about horse anatomy, etc, before putting pen to paper. This thing is being trained exactly once, and can’t learn dynamically. That’ll be the next step I think.


What are you are describing is pretty much reinforcement learning (or learning with access to a query-able knowledge engine, or active learning, or all of these combined). There is work on a bunch of variations of this, but it's true that it's early days for combining it with generative systems.


Yeah, I think the query-able knowledge engine is the key here, although I think it’s maybe more like “we haven’t figured out how to generalize conceptual learning”. The computer not only has to be able to query images on the internet, but also know how/what to query, which includes a bunch of actions computers are currently incapable of. In some cases, we might complete a drawing task by traveling to a new place, or taking an action (throw an egg at concrete) not querying the internet.


> Unsurprisingly, a large number came from stock image sites...Adobe Stock’s...iStockPhoto...Shutterstock

Are they ok with their stock photos being used to train a service that's likely to bite into their stock photo business?


The general understanding is that noone cares if they're ok with that as training a model (even if a competing service) is not among the exclusive rights of copyright owners where their permission is required - if you have a legally obtained copy, you can use it as you want (except for the specific explicitly enumerated use cases that copyright law awards exclusively to the copyright holder) no matter if the copyright holder likes it or not.


Thinking over this "magical" tech. A distinctive painting style is one of the biggest goals for an artist. Producing unique stroke and clear expression takes years of hard work and experimentation.

An artist decides to sell prints of an expensive artwork, he or she publishes a photo on their website. AI scrapers get the images in data set update. Game over for the artist.

I hope for a class action over this training data sets. I get that kids have fun with the new photoshop filters. I get that software is eating "the world" but someone must wake up an push the kill switch. It is possible.


Sounds like a hopeless protectionist endeavor, reminiscent of cartoonish Keynesian economics busywork: "We can't permit development of adding machines, what about all the hard work people have put into memorizing multiplication tables?!".


No. Sounds like common sense. Not popular in the tech community nowadays. Data is the new "petrol". Since when the petrol is free of charge? People in mass are clueless. If you use my "human" accomplishments as an energy source, you must pay me. Period.


> Since when the petrol is free of charge?

At around the same time that it actually resembles "the new data", or maybe even shares a single quality with it? Being a physical object, it is bound by physical properties... like scarcity. Data, being an abstract concept, suffers no such constraint. Same story for whatever artistic technique you've imagined to be not only valuable, but novel to all of human experience and wholly owned by you and you alone. Your valuation of your worth and that of your labor is laughably overinflated and the market is telling you so. Period.


> Your valuation of your worth and that of your labor is laughably overinflated and the market is telling you so. Period.

Good luck with that. :)

Everything around you is a product of Intellectual Property. Your logic is laughably naive.

The market will shift to AI generated nonsense moved by greed and over-optimization. People will reject this en masse.

And there is one big undeniable reason: Human Psychology.


In which I run `order by punsafe desc` and immediately regret it.


Although it did show me that my masseuse isn't as skilled as I thought


It's hard for user to produce expected effects of result without clear guidance on correct grammar, contextual information,...


NSFW is entertaining. It tends to think “knobs” in the prompt mean “female breasts” which is annoying.


> Nearly half of the images, about 47%, were sourced from only 100 domains, with the largest number of images coming from Pinterest

This makes me vaguely uneasy. All these models and tools are almost exclusively "western".


> https://github.com/gradio-app/gradio

Plenty of asian / artists / styles on the datasets no?


It feels like a tsunami is coming and we have no idea how big it will be.


Like the self driving revolution? Or the bitcoin/blockchain revolution?

Personally, I’m not even getting out my popcorn yet.


Bitcoin / Blockchain doesn't have any intrinsic value other than to those who believe in it. Self driving cars (Level 4-5) are not available to the public and is still in development. This stuff is real, produces some incredible results, available to the public and advancing at a rapid rate.


The output of these models seems really impressive, but for my money the notion that it has value is undermined by the way its trainers keep "proprietary" data that is likely to be in violation of image usage rights at a large scale. What is the true value of something that can only be had at the other end of misbegotten extraction/exploitation? It seems like a similar trade-off to the one that web3 proponents are asking us to make. The apparent end-game is that we'll kill off all the true value creators - the working artists responsible for the source data - and all we'll be left with is an artifact of their works.


No, real artists will just become artisans, like any producers of hand made goods


Real people are creating real things with this tech right now. Beyond that, people are enthusiastically building on this technology to create higher level tools. This will only be able to go so far with the stable diffusion model, but the ceiling is still very high with what we already have, and given the pace of model progress we can realistically expect the next 10 years or so to be absolutely transformative for art, and probably after that writing and music.


Fair position given the failure of crypto to live up to the revolutionary hype.

This is clearly different. The value has been demonstrated - and it has clear implications for a lot of jobs.


Very different. This stuff is genuinely useful already, and is getting more effective every day.


Self driving cars are on the road. Any two people on Earth are able to trustlessly transmit value between them using Bitcoin. You seem cynical.


Google still can't reliably figure out that an email from jpm0r4ncha$e telling me how much money I've won is spam. Once they nail that down, then maybe I'll step inside one of their self driving cars. Until then, I'll laugh at the video where the Tesla flattens a child-sized mannequin.


Any two wallet addresses are able to. That doesn’t mean the people are. By abstracting the actual process of getting and using the Bitcoin on both ends you’ve lost all actual real world detail.

…and they can still lose it all to typos or fees permanently.


Aaaand the estimated percentage of images released under a CC license, or public domain, iiiiis...?


So excellent. Flipping the story we see all the time on it's head. AI's quasi-mystical powers are endless spectacle. Taking a look through the other side of the looking glass is vastly overdue. Amazing work.

Just just just starting to scratch the surface. 2% of the data gathered, sources identified. These are a couple of sites we can now source as the primary powerers of AI. Barely reviewed, dived into in terms of the content itself. We have so little sense & appreciation for what lurks beneath, but this is a go.


It is an unambiguous social necessity to demystify these things.

In a world where the lay public didn’t really know about photoshop, photoshop would be a terrifying weapon.

Likewise modern ML is for the most part mysterious and/or menacing because it’s opaque and arcane and mostly controlled by big corporate R&D labs.

Get some charismatic science popularizers out there teaching people how it works, all the sudden not such a big scary thing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: