Hacker Newsnew | past | comments | ask | show | jobs | submit | carschno's commentslogin

Technically not OCR, but HTR (hand-written text/transcript recognition) is still difficult. LLMs have increased accuracy, but their mistakes are very hard to identify because they just 'hallucinate' text they cannot digitize.


This. I am reading old vital records in my family genealogy quest, and as those are sometimes really difficult to read, I turned to LLMs, hearing they are great in OCR. It’s been… terrible. The LLM will transcribe the record without problems, the output seems completely correct, a typical text of a vital record. Just… the transcribed text has nothing to do with my specific record. On the other hand, transkribus.eu has been fairly usable for old vital record transcription – even though the transcribed text is far from perfect, many letters and words are recognized incorrectly, it helps me a lot with the more difficult records.


We ran a small experiment internally on this and it looked like Gemini is better at handwriting recognition than I am. After seeing what it parsed, I was like "oh yeah, that's right". I do agree that instead of saying "Sorry, I can't read that" it just made up something.


I have a thought that whilst LLM providers can say "Sorry" - there is little incentive and it will expose the reality that they are not very accurate, nor can be properly measured. That said, there clearly are use cases where if the LLM can't a certain level of confidence it should refer to the user, rather than guessing.


This is actively being worked on my pretty much every major provider. It was the subject of that recent OpenAI paper on hallucinations. It's mostly caused by benchmarks that reward correct answers, but don't penalize bad answers more than simply not answering.

E.g.

Most current benchmarks have a scoring scheme of

1 - Correct Answer 0 - No answer or incorrect answer

But what they need is something more like

1 - Correct Answer 0.25 - No answer 0 - Incorrect answer

You need benchmarks (particularly those used in training) to incentivize the models to acknowledge when they're uncertain.


Interesting - have you tried sending the image and 'hallucinated' text together to a review LLM to fix mistakes?

I don't have a use case of 100s or 1000s of hand-written notes have to be transcribed. I have only done this with whiteboard discussion snapshots and it has worked really well.


Often, the review LLM will also say everything is okay when it’s made up too.


I think the actual question was: why would a mail provider develop their own an email client?


If it was about an email client, they wouldn’t mention Electron. Read between the lines.


> I am having a really hard time communicating this problem to executives

When you hit such a wall, you might not be failing to communicate, nor them failing to understand. In reality, said executives have probably chosen to ignore the issue, but also don't want to take accountability for the eventual leaks. So "not understanding" is the easiest way to blame the engineers later.


It doesn't even need to be blaming the engineers in this case, they can blame "the AI" and most people will accept that and let whatever incident happened slide. If somebody questions the wisdom of putting AI in such a position, they can be dismissed as not appreciating new technology (even though their concern is valid.)


Yeah, it is usually not about blaming the engineers in my experience. It is about so they can make a descion they want to make without having to think too hard or take any accountability. If nobody knew at the time it was bad everyone can just act surprised and call it an accident and just go on with their lives making similar uninformed descisions.

In their dream world the engineers would not know about it either.

Edit: Maybe we should call this style vibe management. :D


"the AI did it" is going to be the new "somebody hacked my facebook account"

I wish I had a way of ensuring culpability remains with the human who published the text, regardless of who/what authored it.


if you're in a regulated field like law or medicine and you fuck up signing some AI slop with your name, you should loose your license at the very least

tools are fine to use, personal responsability is still required. Companies already fuck up with this too much


I think it needs to be a cultural expectation. I don't know how we get there, though.


Yep. AI is wonderful for IP laundering and accountability laundering (is this even a term? It is now!)


worse, they can be dismissed as an abstract ”ai is dangerous” and used to justify funnelling money to the various ai safety charlatans


In this case it looks like the executives should fire the OP and hire the 2nd poster who came up with a solution. C'mon lazy executives.


Nice explanations! A (more advanced) aspect which I find missing would be the difference between encoder-decoder transformer models (BERT) and "decoder-only", generative models, with respect to the embeddings.


Minor correction, BERT is an encoder (not encoder-decoder), ChatGPT is a decoder.

Encoders like BERT produce better results for embeddings because they look at the whole sentence, while GPTs look from left to right:

Imagine you're trying to understand the meaning of a word in a sentence, and you can read the entire sentence before deciding what that word means. For example, in "The bank was steep and muddy," you can see "steep and muddy" at the end, which tells you "bank" means the side of a river (aka riverbank), not a financial institution. BERT works this way - it looks at all the words around a target word (both before and after) to understand its meaning.

Now imagine you have to understand each word as you read from left to right, but you're not allowed to peek ahead. So when you encounter "The bank was..." you have to decide what "bank" means based only on "The" - you can't see the helpful clues that come later. GPT models work this way because they're designed to generate text one word at a time, predicting what comes next based only on what they've seen so far.

Here is a link also from huggingface, about modernBERT which has more info: https://huggingface.co/blog/modernbert

Also worth a look: neoBERT https://huggingface.co/papers/2502.19587


As an extreme example that can (intentionally) confuse even human readers, see https://en.wikipedia.org/wiki/Garden-path_sentence


Complete LLM internals noob here: Wouldn't this make GPTs awful at languages like German with separable word prefixes?

E.g. Er macht das Fenster. vs Er macht das Fenster auf.

(He makes the window. vs He opens the window.)


Or exceptionally good at german because they have to keep better track of what is meant and anticipate more?

No I don't think it makes any noticeable difference :)


I'm probably way too English brained :D


Further to @dust42, BERT is an encoder, GPT is a decoder, and T5 is an encoder-decoder.

Encoder-decoders are not in vogue.

Encoders are favored for classification, extraction (eg, NER and extractive QA) and information retrieval.

Decoders are favored for text generation, summarization and translation.

Recent research (see, eg, the Ettin paper: https://arxiv.org/html/2507.11412v1 ) seems to confirm the previous understanding that encoders are indeed better for “encoder task” and vice-versa.

Fundamentally, both are transformers and so an encoder could be turned into a decoder or a decoder could be turned into an encoder.

The design difference comes down to bidirectional (ie, all tokens can attend to all other tokens) versus autoregressive attention (ie, the current token can only attend to the previous tokens).


You can use an encoder style architecture with decoder style output heads up top for denoising diffusion mode mask/blank filling. They seem to be somewhat more expensive on short sequences than GPT style decoder-only models when you batch them, as you need fewer passes over the content and until sequence length blows up your KV cache throughout cost, fewer passes are cheaper. But for situations that don't get request batching or where the context length is so heavy that you'd prefer to get to exploit memory locality on the attention computation, you'd benefit from diffusion mode decoding.

A nice side effect of the diffusion mode is that it's natural reliance on the bidirectional attention from the encoder layers provides much more flexible (and, critically, context-aware) understanding so as mentioned, later words can easily modulate earlier words like with "bank [of the river]"/"bank [in the park]"/"bank [got robbed]" or the classic of these days: telling an agent it did wrong and expecting it to in-context learn from the mistake (in practice decoder-only models basically merely get polluted from that, so you have to re-wind the conversation, because the later correction has literally no way of backwards-affecting the problematic tokens).

That said, the recent surge in training "reasoning" models to utilize thinking tokens that often get cut out of further conversation context, and all via a reinforcement learning process that's not merely RLHF/preference-conditioning, is actually quite related: discrete denoising diffusion models can be trained as a RL scheme during pre training where the training step is provided the outcome goal and a masked version as the input query, and then trained to manage the work done in the individual steps on it's own to where it eventually produces the outcome goal, crucially without prescribing any order of filling in the masked tokens or how many to do in which step.

A recent paper on the matter: https://openreview.net/forum?id=MJNywBdSDy


Until we got highly optimized decoder implementations, decoders for prefill were often even implemented by using the same implementation as an encoder, but logit-masking inputs using a causal mask before the attention softmax so that tokens could not attend to future tokens.


It's called grassroots marketing. It works particularly well in the context of GenAI because it is fed with esoteric and ideological fragments that overlap with common beliefs and political trends. https://en.wikipedia.org/wiki/TESCREAL

Therefore, classical marketing is less dominant, although more present at down-stream sellers.


Right. Let's take a bunch of semi-related groups I don't like, and make up an acronym for them so any of my criticism can be applied to some subset of those groups in some form, thus making it seem legitimate and not just a bunch of half-assed strawman arguments.

Also, I guess you're saying I'm a paid shill, or have otherwise been brainwashed by marketing of the vendors, and therefore my positive experiences with LLMs are a lie? :).

I mean, you probably didn't mean that, but part of my point is that you see those positive reports here on HN too, from real people who've been in this community for a while and are not anonymous Internet users - you can't just dismiss that as "grassroot marketing".


> I mean, you probably didn't mean that

Correct, I think you've read too much into it. Grassroots marketing is not a pejorative term, either. Its strategy is to trigger positive reviews about your product, ideally by independent, credible community members, indeed.

That implies that those community members have motivations other than being paid. Ideologies and shared beliefs can be some of them. Being happy about the product is a prerequisite, whatever that means for the individual user.



There are remarkable parallels to the Relotius scandal that took place at the German magazine Der Spiegel a few years ago (although in a bigger and more systematic way): https://en.m.wikipedia.org/wiki/Claas_Relotius#Fabrication_o...


Can you recommend some in Amsterdam?


Mailbox looks very solid, although I don't have long-term experience: https://mailbox.org

It provides email, online storage, video conferencing, calendar etc., all of it privacy-preserving by default. You explicitly don't have to provide any personal details.


Seconded. I'm using mailbox.org for my business for 4 years now, and haven't had any problems so far.


You are looking at it from a product perspective. From a scientific perspective, it just means the respective benchmark is meaningless, so we don't know how well such a model generalizes.


Not so! From a scientific perspective the result you can achieve matters, no one is a blank slate.

For humans this is true as well. The way you teach matters. Look at how the bell curve got absolutely demolished for example when math was taught this way:

https://archive.nytimes.com/opinionator.blogs.nytimes.com/20...


Another way to look at this is: The first assembly language compiler was handcoded in binary to begin with, and then that compiler's machine code was translated to the more expressive language (assembly). Similar for Fortran/C/etc. from assembly code. Progressively, more expressive languages have been bootstrapped from prior lower-level languages. In a similar way, perhaps a more concise LLM can be built by utilizing a less efficient one?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: