> I added the koala kicking a football as well, I have to admit that they come out just as bad.
And this is exactly my point.
> The problem is, if certain training data is blanket removed, it creates holes in the understanding a model has. We see this in language models as well: censored models can get very obstinate in their refusal to discuss certain completely normal topics, because it links them to something that has been scrubbed.
This is doubly wrong. First, not having porn in your dataset absolutely has no bearing on your ability to draw a man kicking a football. Second, the thing you're discussing is not a lack of training data but a fine tuning to try and remove it due to the inevitability of some porn slipping in. SDXL did this too, but its a perfectly serviceable model.
What we're seeing here is almost certainly just the model being bad because it's much smaller. Drawing unusual poses like kicking a ball is much more difficult than people expect.
This drastic drop is performance is entirely consistent with similar drops of performance seen on LLMs in their respective space for a comparable change in parameter count.
People are contorting themselves into a conspiracy because they're seeing a lot of bad humans get drawn because HUMANS ARE WHAT PEOPLE OVERWHELMINGLY TRY TO DRAW. The model is just generically bad.
But the model is _really good_ at other subjects, including very complicated ones, like the motorcycle.
By "certain training data" I don't mean porn. It's probably for the best that they did, but the aggressive removal of all copyrighted data, art and images of celebrities must have impacted the output quality compared to the older models that just used everything. This includes many pictures of people playing football, women lying in grass and other similar things. Most finetuning is basically about bringing that back.
If "HUMANS ARE WHAT PEOPLE OVERWHELMINGLY TRY TO DRAW", then you would expect the model to be better at it, no? Or at least have _some_ understanding of poses besides T.
It's not a conspiracy to wonder why the main thing that a lower size models fails on is humanoid poses, when things like objects, animals and architecture generally come out just fine. Are you saying an image like this is significantly easier to create than a man kicking a football?
Try the prompt yourself, it will keep producing perfectly fine images with only small mistakes. Further testing seems pointless though, since by now it's confirmed that the current model was a failed product that got rushed out for some reason.
You're vastly underestimating the difficulty of drawing humans. Move your hands just a little bit and you have a completely different geometry. Humans are MUCH harder to draw than a motorcycle. A motorcycle pretty much always looks the same.
It is much, much easier to draw your link vs. a human.
An aztec castle (edit: not sure if this is a "castle" but age of empires 2 taught me that's what it is) always looks like that. Foliage has many variants but all of them are basically fine. Same for rocks. The seal has minor variation. Probably not a lot that the model cares for. The seal is wrong anyway.
Man kicking a football has many possible interpretations and most of the model errors are simply reflecting the model being bad at deciding to do just one.
The model could be bad because it simply has less data, but it's much easier to explain it as bad because it's a smaller, less capable model. Just like we see with LLMs
And this is exactly my point.
> The problem is, if certain training data is blanket removed, it creates holes in the understanding a model has. We see this in language models as well: censored models can get very obstinate in their refusal to discuss certain completely normal topics, because it links them to something that has been scrubbed.
This is doubly wrong. First, not having porn in your dataset absolutely has no bearing on your ability to draw a man kicking a football. Second, the thing you're discussing is not a lack of training data but a fine tuning to try and remove it due to the inevitability of some porn slipping in. SDXL did this too, but its a perfectly serviceable model.
What we're seeing here is almost certainly just the model being bad because it's much smaller. Drawing unusual poses like kicking a ball is much more difficult than people expect.
This drastic drop is performance is entirely consistent with similar drops of performance seen on LLMs in their respective space for a comparable change in parameter count.
People are contorting themselves into a conspiracy because they're seeing a lot of bad humans get drawn because HUMANS ARE WHAT PEOPLE OVERWHELMINGLY TRY TO DRAW. The model is just generically bad.