Hacker Newsnew | past | comments | ask | show | jobs | submit | tifa2up's commentslogin

https://agentset.ai/leaderboard/embeddings good rundown of other open-source embedding models


I'm building https://github.com/agentset-ai/agentset, RAG as a service that works quite well out of the box.

We achieve this performance by baking in the best practices before any tweaking


How does it handle retrieval in a multi-turn conversation? Is there an intent graph involved?

Does it summarize past context or keep it all?


Right now it's single shot, we're looking into building an "Agentic Retrieval" based on Claude ADK. tbd how it'll work


So retrieve once on the first message, and then use that context for the rest of the conversation?


We tried GPT-5 for a RAG use case, and found that it performs worse than 4.1. We reverted and didn't look back.


4.1 is such an amazing model in so many ways. It's still my nr. 1 choice for many automation tasks. Even the mini version works quite well and it has the same massive context window (nearly 8x GPT-5). Definitely the best non-reasoning model out there for real world tasks.


Can you elaborate on that? In which part of the RAG pipeline did GPT-4.1 perform better? I would expect GPT-5 to perform better on longer context tasks, especially when it comes to understanding the pre-filtered results and reasoning about them


For large context (up to 100K tokens in some cases). We found that GPT-5: a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error


Interesting. https://www.robert-glaser.de/prompts-as-programs-in-gpt-5/ claims GPT-5 has amazing!1!! instruction following. Is your use-case very different, or is this yet another case of "developer A got lucky, developer B tested more things"?


Think it varies by use case. It didn't do well with long context


ChatGPT when using 5 or 5-Thinking doesn’t even follow my “custom instructions” on the web version. It’s a serious downgrade compared to the prior generation of models.


It does “follow” custom instructions. But more as a suggestion rather than a requirement (compared to other models)


Ah, 100k/125K this is what poses problems I believe. GPT-5 scores should go up should you process contexts that are 10 times shorter.


How do you objectively tell whether a model "performs" better than another?


Not the original commenter but I work in the space and we have large annotated datasets with "gold" evidence that we want to retrieve, the evaluation of new models is actually very quantitative.


> but I work in the space

Ya, the original commenter likely does not work in the space - hence the ask.

> the evaluation of new models is actually very quantitative.

While you may be able to derive a % correct (and hence quantitative), they are by their nature very much not quantitative. Q&As on written subjects are very much subjective. Example benchmark: https://llm-stats.com/benchmarks/gpqa Even though there are techniques to reduce overfitting, it still isn't eliminated. So it's very much subjective.


So… You did look back then didn’t look forward anymore… sorry couldn’t resist.


Don't solve it on the STT level. Get the raw transcription from Gemini then pass the output to an LLM to fix company names and other modifications.

Happy to share more details if helpful.


Yeah, I've done it with industry-specific acronyms and this works well. Generate a list of company names and other terms it gets wrong, and give it definitions and any other useful context. For industry jargon, example sentences are good, but that's probably not relevant for company names.

Feed it that list and the transcript along with a simple prompt along the lines of "Attached is a transcript of a conversation created from an audio file. The model doing the transcription has trouble with company names/industry terms/acronyms/whatever else and will have made errors with those. I have also attached a list of company names/etc. that may have been spoken in the transcribed audio. Please review the transcription, and output a corrected version, along with a list of all corrections that you made. The list of corrections should include the original version of the word that you fixed, what you updated it to, and where it is in the document." If it's getting things wrong, you can also ask it to give an explanation of why it made each change that it did and use that to iterate on your prompt and the context you're giving it with your list of words.


Which specific model do you use?


I've had some luck with this in other contexts. Get the initial transcript from STT (e.g. whisper), then feed that in to gemini with a prompt giving it as much extra context as possible. For example "This is a transcript from a youtube video. It's a conversation between x people, where they talk about y and z. Please clean up the transcript, paying particular attention to company names and acronyms."


I've done the same, it works very well.


Yes, we got 187 self-serve users (all on the free plan). And are in talks with an enterprise now.


You typically add a lot of metadata with each chunk text to be able to filter it, and do to include in the citations. Injecting metadata means that you see what metadata adds helpful context to the LLM, and when you pass the results to the LLM you pass them in a format like this:

Title: ... Author: ... Text: ...

for each chunk, instead of just passing the text


Quite a decent hit. Local models don't perform very well in long contexts. We're planning to support a local-only offline set-up for people to host w/o additional dependencies


OP. The way you improve it is move away from single shot semantic/keyword search and have an agentic system that can evaluate results and do follow-up queries.


OP. We migrated to GPT-5 when it came out but found that it performs worse than 4.1 when you pass lots of context (up to 100K tokens in some cases). We found that it:

a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error

Again, these were only observed in RAG when you pass lots of chunks, GPT-5 is probably a better model for other taks.


love the share, ty


OP. Reranking is a specialized LLM that takes the user query, and a list of candidate results, then re-sets the order based on which ones are more relevant to the query.

Here's sample code: https://docs.cohere.com/reference/rerank


What is the difference between reranking versus generating text embeddings and comparing with cosine similarity?


My understanding:

If you generate embeddings (of the query, and of the candidate documents) and compare them for similarity, you're essentially asking whether the documents "look like the question."

If you get an LLM to evaluate how well each candidate document follows from the query, you're asking whether the documents "look like an answer to the question."

An ideal candidate chunk/document from a cosine-similarity perspective, would be one that perfectly restates what the user said — whether or not that document actually helps the user. Which can be made to work, if you're e.g. indexing a knowledge base where every KB document is SEO-optimized to embed all pertinent questions a user might ask that "should lead" to that KB document. But for such documents, even matching the user's query text against a "dumb" tf-idf index will surface them. LLMs aren't gaining you any ground here. (As is evident by the fact that webpages SEO-optimized in this way could already be easily surfaced by old-school search engines if you typed such a query into them.)

An ideal candidate chunk/document from a re-ranking LLM's perspective, would be one that an instruction-following LLM (with the whole corpus in its context) would spit out as a response, if it were prompted with the user's query. E.g. if the user asks a question that could be answered with data, a document containing that data would rank highly. And that's exactly the kind of documents we'd like "semantic search" to surface.


I've been thinking about the problem of what to do if the answer to a question is very different to the question itself in embedding space. The KB method sounds interesting and not something I thought about, you sort work on the "document side" I guess. I've also heard of HYDE, the works on the query side, you generate hypothetical answers instead to the user query and look for documents that are similar to the answer, if I've understood it correctly.


The main point didn't get hit on by the responses. Re-ranking is just a mini-LLM (for latency/cost reasons) that does a double heck. Embedding model finds the closest M documents in R^N space. Re-ranker picks the top K documents from the M documents. In theory, if we just used Gemini 2.5 Pro or GPT 5 as the re-ranker, the performance would even be better than whatever small re-ranker people choose to use.


text similarity finds items that closely match. Reranking my select items that are less semantically "similar" but are more relevant to the query.


the reranker is a cross encoder that sees the docs and the query at the same time. What you normally do is you generating embeddings ahead of time, independent of the prompt used, calculate cosine similarity with the prompt, select the top-k best chunks that match the prompt and only then use a reranker to sort them.

embeddings are a lossy compression, so if you feed the chunks with the prompt at the same time, the results are better. But you can't do this for your whole db, that's why the filtering with cosine similarity at the beginning.


Because LLMs are a lot smarter than embeddings and basic math. Think of the vector / lexical search as the first approximation.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: