Doesn't this just mean that the LLM ingested training data where people talk about banning controversial propaganda type of newspapers while nobody talks about banning nyt or wapo?
I think if people took the time to understand how LLMs choose word weights based on training data, they would understand that these results are somewhat deterministic.
Instead, the preferred heuristic is to look for a bogeyman.
But the base model, when its trained on the whole internet, will have some extreme biases on topics where there's a large and vocal group on one side and the other side is very silent. So RLHF is the attempt to correct for the biases on the internet.