Is there any work on reverse engineering LLMs, especially the closed source API ones? For example, how can we learn about the data used in Claude Sonnet 4.5 training?
And more tricky but as important, is there any work on extrapolating the pretrained model AFTER it's RLHF'd? For example, what kinds of biases did exist in gpt-4o before it was unbiased?
Do biases go away completely or they just get suppressed down deep in the model's "mind"?
> Do biases go away completely or they just get suppressed down deep in the model's "mind"?
Bias is a human term, and couching the conversation in that context does nothing to address the issue here, because it gets into the quagmire of social context.
Let's say LLM's had taken off 15 years ago at the point system d launched. All the answers given are going to weight toward the old init system simply because there is a lack of information.
LLM's are only repeating the data they are given, and it's cheaper to remove the data after the fact than it is to try to scrub it out of the training data.
"only" and "repeating" aren't accurate here. There's a lot of steps between the pretraining tokens and the LLM. I mean, you can pretty much do whatever you want in the process of making one or running it.
For instance you could use pretraining/SFT to steer something away from a document instead of towards it and that wouldn't be "only repeating" it. Though I don't know if that's actually possible, and afaik it is true RL reweights existing data instead of learning new things.
And more tricky but as important, is there any work on extrapolating the pretrained model AFTER it's RLHF'd? For example, what kinds of biases did exist in gpt-4o before it was unbiased?
Do biases go away completely or they just get suppressed down deep in the model's "mind"?