Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There is a flaw with the base problem: each tweet only has one label, while a tweet is often about many different things and can't be delinated so cleanly. Here's an alternate approach that both allows for multiple labels and lower marginal costs (albeit higher initial cost) for each tweet classified.

1. Curate a large representative subsample of tweets.

2. Feed all of them to an LLM in a single call with the prompt along the lines of "generate N unique labels and their descriptions for the tweets provided". This bounds the problem space.

3. For each tweet, feed them to a LLM along with the prompt "Here are labels and their corresponding descriptions: classify this tweet with up to X of those labels". This creates a synthetic dataset for training.

4. Encode each tweet as a vector as normal.

5. Then train a bespoke small model (e.g. a MLP) using tweet embeddings as input to create a multilabel classification model, where the model predicts the probability for each label that it is the correct one.

The small MLP will be super fast and cost effectively nothing above what it takes to create the embedding. It saves time/cost from performing a vector search or even maintaining a live vector database.



I just built a logistic regression classifier for emails and agree

Just using embeddings you can get really good classifiers for very cheap

You can use small embeddings models too, and can engineer different features to be embedded as well

Additionally, with email at least, depending on the categories you need, you only need about 50-100 examples for 95-100% accuracy

And if you build a simple CLI tool to fetch/label emails, it’s pretty easy/fast to get the data


I'm interested to see examples! Is this shareable?


Why wouldn't you use OP's approach to build up the representative embeddings, and then train the MLP on that?

That way you can effectively handle open sets and train a more accurate MLP model.

With your approach I don't think you can get a representative list of N tweets which covers all possible categories. Even if you did, the LLM would be subject to context rot and token limits.


I am doing a similar thing for technical documentation, basically i want to recommend some docs at the end of each document. I wanted to use the same approach you outlined to generate labels for each document and thus easily find some “further reading” to recommend for each.

How big should my sample size be to be representative ? It’s a fairly large list of docs across several products and deployment options. I wanted to pick a number of docs per product. Maybe I’ll skip the steps 4/5 as I only need to repeat it occasionally once I labelled everything once


If you're just generating labels from existing documents, you don't need that many data points, but the LLM may hallucinate labels if you have too few relative to the number of labels you want.

For training the model downstream, the main constraint on dataset size is how many distinct labels you want for your use case. The rules of thumb are:

a) ensuring that each label has a few samples

b) atleast N^2 data points total for N labels to avoid issues akin to the curse of dimensionality


“Up to X” produces a relatively strong bias for producing X yesses. “For each of these possible labels, write a sentence describing whether it applies or not, then summarize with the word Yes or No” does a bounded amount of thinking per label and removes the bias, at the cost of using more tokens (in your pre-processing phase) and requiring a bit of post-processing.


Those are just simple prompt examples: obviously more prompt engineering would be necessary.

However, modern LLMs, even the cheaper ones, do handle the up to X constraint correctly without consistently giving X.


There are multiple labels per tweet in the code examples, so not sure where you got that from.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: