Hmm, I think a mixture of beta distributions could work just as well as cateogri...

programjames · 2025-10-24T03:10:41 1761275441

Update 2:

After another 24 hours of training and around 100 epochs, we get down to 4.4 bits/dim and colors are starting to emerge[1]. However, an issue a friend brought up is that log-likelihood + beta distribution weights values near 0 and 1 much higher:

     log(Beta likelihood) = alpha * log(x) + beta * log(1-x)
                                      ^
                               log(0) --> oo

This means we should see most outputs be pure colors: black, white, red, blue, green, cyan, magenta, or yellow. 3.6% of the channels are 0 or 255, up from 1.4% after 50 epochs[2]. Apparently, an earth-mover loss might be better:

    E_{x ~ output distribution}[|correct - x|]

I could retrain this for another day or two, but PixelRNN is really slow, and I want to use my GPU for other things. Instead, I trained a 50x faster PixelCNN for 50 epochs with this new loss and... it just went to the average pixel value (0.5). There's probably a way to train a mixture of betas, but I haven't figured it out yet.

[1]: https://imgur.com/kGbERDg [2]: https://imgur.com/iJYwHr0

programjames · 2025-10-23T01:25:32 1761182732

Update 1: After ~12 hours of training and 45 epochs on CIFAR, I'm starting to see textures.

https://imgur.com/MzKUKhH

programjames · 2025-10-25T02:55:01 1761360901

Update 3:

Okay, so my PixelCNN masking was wrong... which is why it went to the mean. The earth-mover did get better results than negative log-likelihood, but I found a better solution!

The issue with negative log-likelihood was the neural network could optimize solely around zero and one because there are poles there. The key insight is that the color value in the image is not zero or one. If we are given #00, all we really know is the image from the real world had a brightness between #00 and #01, so we should be integrating the probability density function from 0 to 1/256 to get the likelihood.

It turns out PyTorch does not have a good implementation of Beta.cdf(), so I had to roll my own. Realistically, I just asked the chatbots to tell me what good algorithms there were and to write me code. I ended up with two:

(1) There's a known continued fraction form for the CDF, so combined with Lentz' algorithm it can be computed.

(2) Apparently there's a pretty good closed-form approximation as well (Temme [1]).

The first one was a little unstable in training, but worked well enough (output: [2], color hist: [3]). The second was a little more stable in training, but had issues with nan's near zero and one, so I had to clamp things there which makes it a little less accurate (output: [4], color hist: [5]).

The bits/dim gets down to ~3.5 for both of these, which isn't terrible, but there's probably something that can be done better to get it below 3.0. I don't have any clean code to upload, but I'll probably do that tomrrow and edit (or reply to) this comment. But, that's it for the experiments!

Anyway, the point of this experiment was because this sentence was really bothering me:

> But categorical distributions are better for modelling.

And when I investigated why you said that, it turns out the PixelRNN authors used a mixture of Gaussians, and even said they're probably losing some bits because Gaussians go out of bounds and need to be clipped! So, I really wanted to say, "seems like a skill issue, just use Beta distributions," but then I had to go check if that really did work. My hypothesis was Betas should work even better than a categorical distribution because the categorical model would have to learn nearby outputs are indeed nearby while this is baked into the Beta model. We see the issue show up in the PixelRNN paper, where their outputs are very noisy compared to mine (histogram for a random pixel: [6]).

[1]: https://ir.cwi.nl/pub/2294/2294D.pdf [2]: https://imgur.com/e8xbcfu [3]: https://imgur.com/z0wnqu3 [4]: https://imgur.com/Z2Tcoue [5]: https://imgur.com/p7sW4r9 [6]: https://imgur.com/P4ZV9n4