4.1 is such an amazing model in so many ways. It's still my nr. 1 choice for many automation tasks. Even the mini version works quite well and it has the same massive context window (nearly 8x GPT-5). Definitely the best non-reasoning model out there for real world tasks.
Can you elaborate on that? In which part of the RAG pipeline did GPT-4.1 perform better? I would expect GPT-5 to perform better on longer context tasks, especially when it comes to understanding the pre-filtered results and reasoning about them
For large context (up to 100K tokens in some cases). We found that GPT-5:
a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error
Interesting. https://www.robert-glaser.de/prompts-as-programs-in-gpt-5/ claims GPT-5 has amazing!1!! instruction following. Is your use-case very different, or is this yet another case of "developer A got lucky, developer B tested more things"?
ChatGPT when using 5 or 5-Thinking doesn’t even follow my “custom instructions” on the web version. It’s a serious downgrade compared to the prior generation of models.
Not the original commenter but I work in the space and we have large annotated datasets with "gold" evidence that we want to retrieve, the evaluation of new models is actually very quantitative.
Ya, the original commenter likely does not work in the space - hence the ask.
> the evaluation of new models is actually very quantitative.
While you may be able to derive a % correct (and hence quantitative), they are by their nature very much not quantitative. Q&As on written subjects are very much subjective. Example benchmark: https://llm-stats.com/benchmarks/gpqa Even though there are techniques to reduce overfitting, it still isn't eliminated. So it's very much subjective.
Yeah, I've done it with industry-specific acronyms and this works well. Generate a list of company names and other terms it gets wrong, and give it definitions and any other useful context. For industry jargon, example sentences are good, but that's probably not relevant for company names.
Feed it that list and the transcript along with a simple prompt along the lines of "Attached is a transcript of a conversation created from an audio file. The model doing the transcription has trouble with company names/industry terms/acronyms/whatever else and will have made errors with those. I have also attached a list of company names/etc. that may have been spoken in the transcribed audio. Please review the transcription, and output a corrected version, along with a list of all corrections that you made. The list of corrections should include the original version of the word that you fixed, what you updated it to, and where it is in the document." If it's getting things wrong, you can also ask it to give an explanation of why it made each change that it did and use that to iterate on your prompt and the context you're giving it with your list of words.
I've had some luck with this in other contexts. Get the initial transcript from STT (e.g. whisper), then feed that in to gemini with a prompt giving it as much extra context as possible. For example "This is a transcript from a youtube video. It's a conversation between x people, where they talk about y and z. Please clean up the transcript, paying particular attention to company names and acronyms."
You typically add a lot of metadata with each chunk text to be able to filter it, and do to include in the citations. Injecting metadata means that you see what metadata adds helpful context to the LLM, and when you pass the results to the LLM you pass them in a format like this:
Quite a decent hit. Local models don't perform very well in long contexts. We're planning to support a local-only offline set-up for people to host w/o additional dependencies
OP. The way you improve it is move away from single shot semantic/keyword search and have an agentic system that can evaluate results and do follow-up queries.
OP. We migrated to GPT-5 when it came out but found that it performs worse than 4.1 when you pass lots of context (up to 100K tokens in some cases). We found that it:
a) has worse instruction following; doesn't follow the system prompt
b) produces very long answers which resulted in a bad ux
c) has 125K context window so extreme cases resulted in an error
Again, these were only observed in RAG when you pass lots of chunks, GPT-5 is probably a better model for other taks.
OP. Reranking is a specialized LLM that takes the user query, and a list of candidate results, then re-sets the order based on which ones are more relevant to the query.
If you generate embeddings (of the query, and of the candidate documents) and compare them for similarity, you're essentially asking whether the documents "look like the question."
If you get an LLM to evaluate how well each candidate document follows from the query, you're asking whether the documents "look like an answer to the question."
An ideal candidate chunk/document from a cosine-similarity perspective, would be one that perfectly restates what the user said — whether or not that document actually helps the user. Which can be made to work, if you're e.g. indexing a knowledge base where every KB document is SEO-optimized to embed all pertinent questions a user might ask that "should lead" to that KB document. But for such documents, even matching the user's query text against a "dumb" tf-idf index will surface them. LLMs aren't gaining you any ground here. (As is evident by the fact that webpages SEO-optimized in this way could already be easily surfaced by old-school search engines if you typed such a query into them.)
An ideal candidate chunk/document from a re-ranking LLM's perspective, would be one that an instruction-following LLM (with the whole corpus in its context) would spit out as a response, if it were prompted with the user's query. E.g. if the user asks a question that could be answered with data, a document containing that data would rank highly. And that's exactly the kind of documents we'd like "semantic search" to surface.
I've been thinking about the problem of what to do if the answer to a question is very different to the question itself in embedding space. The KB method sounds interesting and not something I thought about, you sort work on the "document side" I guess. I've also heard of HYDE, the works on the query side, you generate hypothetical answers instead to the user query and look for documents that are similar to the answer, if I've understood it correctly.
The main point didn't get hit on by the responses. Re-ranking is just a mini-LLM (for latency/cost reasons) that does a double heck. Embedding model finds the closest M documents in R^N space. Re-ranker picks the top K documents from the M documents. In theory, if we just used Gemini 2.5 Pro or GPT 5 as the re-ranker, the performance would even be better than whatever small re-ranker people choose to use.
the reranker is a cross encoder that sees the docs and the query at the same time. What you normally do is you generating embeddings ahead of time, independent of the prompt used, calculate cosine similarity with the prompt, select the top-k best chunks that match the prompt and only then use a reranker to sort them.
embeddings are a lossy compression, so if you feed the chunks with the prompt at the same time, the results are better. But you can't do this for your whole db, that's why the filtering with cosine similarity at the beginning.