Is there any work on reverse engineering LLMs, especially the closed source API ...

tptacek · 2025-10-05T22:10:13 1759702213

Yes.

(I just have these ones off the top of my head because I'm a Nicholas Carlini fan and we interviewed him about these attacks.)

behnamoh · 2025-10-05T23:25:49 1759706749

Thanks for these, I'll have a look!

zer00eyz · 2025-10-05T22:16:22 1759702582

> Do biases go away completely or they just get suppressed down deep in the model's "mind"?

Bias is a human term, and couching the conversation in that context does nothing to address the issue here, because it gets into the quagmire of social context.

Let's say LLM's had taken off 15 years ago at the point system d launched. All the answers given are going to weight toward the old init system simply because there is a lack of information.

LLM's are only repeating the data they are given, and it's cheaper to remove the data after the fact than it is to try to scrub it out of the training data.

astrange · 2025-10-06T00:35:20 1759710920

"only" and "repeating" aren't accurate here. There's a lot of steps between the pretraining tokens and the LLM. I mean, you can pretty much do whatever you want in the process of making one or running it.

For instance you could use pretraining/SFT to steer something away from a document instead of towards it and that wouldn't be "only repeating" it. Though I don't know if that's actually possible, and afaik it is true RL reweights existing data instead of learning new things.

lupusreal · 2025-10-06T13:37:15 1759757835

> Bias is a human term

There are many kinds of bias, plenty of which have nothing to do with culture or social context.