There is a flaw with the base problem: each tweet only has one label, while a tw...

nico · 2025-10-21T01:03:16 1761008596

I just built a logistic regression classifier for emails and agree

Just using embeddings you can get really good classifiers for very cheap

You can use small embeddings models too, and can engineer different features to be embedded as well

Additionally, with email at least, depending on the categories you need, you only need about 50-100 examples for 95-100% accuracy

And if you build a simple CLI tool to fetch/label emails, it’s pretty easy/fast to get the data

tomrod · 2025-10-21T05:41:25 1761025285

I'm interested to see examples! Is this shareable?

meander_water · 2025-10-21T06:37:28 1761028648

Why wouldn't you use OP's approach to build up the representative embeddings, and then train the MLP on that?

That way you can effectively handle open sets and train a more accurate MLP model.

With your approach I don't think you can get a representative list of N tweets which covers all possible categories. Even if you did, the LLM would be subject to context rot and token limits.

portaouflop · 2025-10-21T01:21:02 1761009662

I am doing a similar thing for technical documentation, basically i want to recommend some docs at the end of each document. I wanted to use the same approach you outlined to generate labels for each document and thus easily find some “further reading” to recommend for each.

How big should my sample size be to be representative ? It’s a fairly large list of docs across several products and deployment options. I wanted to pick a number of docs per product. Maybe I’ll skip the steps 4/5 as I only need to repeat it occasionally once I labelled everything once

minimaxir · 2025-10-21T02:15:40 1761012940

If you're just generating labels from existing documents, you don't need that many data points, but the LLM may hallucinate labels if you have too few relative to the number of labels you want.

For training the model downstream, the main constraint on dataset size is how many distinct labels you want for your use case. The rules of thumb are:

a) ensuring that each label has a few samples

b) atleast N^2 data points total for N labels to avoid issues akin to the curse of dimensionality

addaon · 2025-10-21T14:49:11 1761058151

“Up to X” produces a relatively strong bias for producing X yesses. “For each of these possible labels, write a sentence describing whether it applies or not, then summarize with the word Yes or No” does a bounded amount of thinking per label and removes the bias, at the cost of using more tokens (in your pre-processing phase) and requiring a bit of post-processing.

minimaxir · 2025-10-21T15:20:56 1761060056

Those are just simple prompt examples: obviously more prompt engineering would be necessary.

However, modern LLMs, even the cheaper ones, do handle the up to X constraint correctly without consistently giving X.

mattmanser · 2025-10-21T08:51:29 1761036689

There are multiple labels per tweet in the code examples, so not sure where you got that from.