Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is there any work on reverse engineering LLMs, especially the closed source API ones? For example, how can we learn about the data used in Claude Sonnet 4.5 training?

And more tricky but as important, is there any work on extrapolating the pretrained model AFTER it's RLHF'd? For example, what kinds of biases did exist in gpt-4o before it was unbiased?

Do biases go away completely or they just get suppressed down deep in the model's "mind"?



Yes.

https://arxiv.org/abs/2403.06634

https://arxiv.org/abs/2311.17035

(I just have these ones off the top of my head because I'm a Nicholas Carlini fan and we interviewed him about these attacks.)


Thanks for these, I'll have a look!


> Do biases go away completely or they just get suppressed down deep in the model's "mind"?

Bias is a human term, and couching the conversation in that context does nothing to address the issue here, because it gets into the quagmire of social context.

Let's say LLM's had taken off 15 years ago at the point system d launched. All the answers given are going to weight toward the old init system simply because there is a lack of information.

LLM's are only repeating the data they are given, and it's cheaper to remove the data after the fact than it is to try to scrub it out of the training data.


"only" and "repeating" aren't accurate here. There's a lot of steps between the pretraining tokens and the LLM. I mean, you can pretty much do whatever you want in the process of making one or running it.

For instance you could use pretraining/SFT to steer something away from a document instead of towards it and that wouldn't be "only repeating" it. Though I don't know if that's actually possible, and afaik it is true RL reweights existing data instead of learning new things.


> Bias is a human term

There are many kinds of bias, plenty of which have nothing to do with culture or social context.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: