More

frenchmajesty · 2025-10-22T19:37:38 1761161858

You are looking at it wrong. Meta is a business. You know what they sell? Ads.

In fact, they are the #1 or #2 place in the world to sell an ad depending on who you ask. If the future turns out to be LLM-driven, all that ad-money is going to go to OpenAI or worse to Google; leaving Zuck with no revenue.

So why are they after AI? Because they are in the business of selling eyeballs placement and LLM becoming the defacto platform would eat into their margins.

frenchmajesty · 2025-10-21T17:46:17 1761068777

I get a Heroku error trying to view the site.

eric-p7 · 2025-10-21T18:25:43 1761071143

That means you won the game.

purplelemons · 2025-10-21T21:22:58 1761081778

damn, I just lost The Game.

mrheosuper · 2025-10-22T06:35:36 1761114936

Thanks for ruining my 5-year streak.

gowld · 2025-10-21T17:52:47 1761069167

According to its terms and conditions, Heroku is not Web Scale.

SCUSKU · 2025-10-21T19:03:20 1761073400

Have they tried MongoDB?

johnsillings · 2025-10-21T17:56:38 1761069398

frenchmajesty · 2025-10-21T14:42:07 1761057727

Hey OP here. You're not wrong! Leaving aside the philosophical debate (isn't all form of capitalist participation selfishly motivated?), the main motivator was to help me and my friends with a problem we struggled with.

Many solo Entrepreneurs you see on Twitter with large audiences are busy people so they have hired cheap labor from India / Philippines to be the social media manager. They often take on the task of keeping up with the niches and drafting post ideas. The big issue is that the variance in quality of who you hire is very high, and it's also a mental and energy toll to manage an employee who works on the other side of earth.

So the AI helps to scours "here is what all the tech bros are talking about since 3 days ago" and then drafts 3-5 posts and shows them to me so I can curate. I get to keep my page and audience engaged while protecting my time from actual deep work instead of scrolling the feed all day.

frenchmajesty · 2025-10-21T14:13:38 1761056018

Hey OP here. The use-case is to give an Agent the ability to post on my behalf. It can use these class labels to figure out "what are my common niches" and then come up with keyword search terms to find what's happening in those spaces and then draft up some responses that I can curate, edit and post.

This is the kind of work you typically hire cheap social managers overseas to do through Fiverr. However, the variance in quality is very high and the burden of managing people on the other side of the world can be a lot of solo Entrepreneurs.

frenchmajesty · 2025-10-21T14:08:47 1761055727

OP here. I agree! I should've called out why I did _not_ follow that approach as many others have commented the same.

The main reason why is that I needed the classification to be ongoing. My system pulled over thousands of tweets per day and they all needed to be classified as they came for some downstream tasks.

Thus, I couldn't embed all tweets, then cluster, then ...

bungalowmunch · 2025-10-23T19:20:10 1761247210

Do the labels need to be static once the system has started? If not would be interesting to relabel embedding clusters once each hits a certain critical mass of tweets, or do so somewhat continuously.

pietz · 2025-10-21T15:10:53 1761059453

Makes sense, I appreciate the comment. Well written article. Subscribed.

frenchmajesty · 2025-10-21T12:56:14 1761051374

OP here. We embed both the label AND the tweet. So if tweet A is "I love burgers" and tweet B is "I love cheeseburgers", we ask in our vector DB if we have seen a tweet before that is very similar to B? If yes, we skip LLM altogether (cache hit) and just take the class label that A has.

frenchmajesty · 2025-10-21T10:24:07 1761042247

OP here. I agree with you. For production use we use VoyageAI which is usually 2x faster than OpenAI at similar quality levels (p90 is < 200ms) but we're looking at spinning up a local embedding in our cloud environment, that would make p95 < 100ms and make cost negligible as well.

frenchmajesty · 2025-10-20T23:42:52 1761003772

Op here. Yes that's right. We do also insert the current text embedding on misses to expand the boundaries of the cluster.

For instance: I love McDonalds (1). I love burgers. (0.99) I love cheeseburgers with ketchup (?).

This is a bad example but in this case the last text could end up right at the boundary of the similarity to that 1st label if we did not store the 2nd, which could cause a cluster miss we don't want.

We only store the text on cache misses, though you could do both. I had not considered that idea but it make sense. I'm not very concerned about the dataset size because vector storage is generally cheap (~ $2/mo for 1M vectors) and the savings in $$$ not spend generating tokens covers for that expense generously.

frenchmajesty · 2025-10-20T23:32:40 1761003160

OP here. Yes that works too and get you to the same result. Remove risks for bias but the trade-off is higher marginal cost and latency.

The idea is also that this would be a classification system used in production whereby you classify data as it comes, so the "rolling labels" problem still exists there.

In my experience though, you can dramatically reduce unwanted bias by tuning your cosine similarity filter.

frenchmajesty · 2025-10-20T22:42:30 1761000150

OP here. This is true. If you make your min_score .99 you can have very high confidence in copy-pasting the label, but then this is not very useful. The big question is then how far can you get from 0.99 while still having satisfying results?

me_vinayakakv · 2025-10-21T03:58:53 1761019133

Thanks for the article and approach. How did you come up with min_score at the end? Was it by trial and error?