Our search for the best OCR tool (2019)

simonw · on July 12, 2022

I had spectacular results from AWS Textract recently - which when this article was written (2019) wasn't yet openly available.

I fed it thousands of pages of historical scanned documents - including handwritten journals from the 1800s - and it could read them better than I could!

I built a tool to use it (since running it in bulk against PDFs in a bucket took a few too many steps) and wrote about my experiences with it here: https://simonwillison.net/2022/Jun/30/s3-ocr/

amelius · on July 12, 2022

Great that it works for you, but I'm not too happy about big companies assuming that my product is connected to the internet.

driscoll42 · on July 12, 2022

For some comparison, I recently did an OCR comparison for some work for a professor. To set some context, all documents were 1960s era typed or handwritten documents in English, specifically from this archive - http://allenarchive.iac.gatech.edu/. I hand transcribed 50 documents to use as a base comparison and ran them through the various OCR engines getting the results below.

                           Overall       Typed  Handwritten
  OCR Engine          Leven   Cosine  Leven   Cosine  Leven   Cosine
  Amazon Textract     91.63%  98.14%  92.07%  98.76%  87.99%  92.10%
  Google Vision       93.05%  97.97%  93.84%  98.99%  85.86%  88.11%
  Microsoft Azure     80.32%  95.61%  80.65%  96.20%  79.14%  90.21%
  TrOCR               78.66%  93.97%  80.64%  96.65%  59.96%  67.89%
  PaddleOCR           84.82%  90.73%  88.60%  96.28%  49.64%  37.58%
  Tesseract           86.67%  89.53%  91.14%  95.63%  44.54%  31.39%
  Easy OCR            81.79%  85.07%  85.50%  91.89%  46.87%  19.23%
  Keras OCR           58.03%  83.57%  59.32%  89.98%  46.08%  21.20%

Leven is Levenshtein Distance. Overall is a weighted average of typed vs handwritten, 90/10 if I recall correctly. All results were run on my personal machine with a 5950X, 128 GB RAM, and a RTX 3080.

From my analysis the Amazon Textract was excellent, the best of all the paid ones, and while TrOCR and PaddleOCR were the best FOSS ones, the issue with them is that they require a GPU while Tesseract I could use on CPU alone. For instance to OCR all 50 documents.

  Tessearct       1:19
  TrOCR (GPU)    27:33
  TrOCR (CPU)  3:04:22

TrOCR is great if you need to do a few or have GPUs to burn, but Tesseract is by far better if you need good enough for a large volume of documents, and for my project the intent was to make a software plugin that could be sent to libraries/universities, CPU is king.

llaolleh · on July 13, 2022

This would probably make a good blog post!

driscoll42 · on July 13, 2022

I mean to for sure, but the project isn't done yet, there's some NLP work we're doing with the results of the OCR and I really want to do a full series going over all we've done rather than one and then two months later another one.

Mo3 · on July 13, 2022

I second the first users sentiment. Things are going to change permanently into all eternity, better and better solutions come around, existing solutions get better and better - I would love reading a blog post on your current state of research already

bufo · on July 12, 2022

The iOS / Apple OCR Swift API is drastically better than the ones I’ve tried online (eg. Microsoft) or the open source ones (Tesseract). Highly recommended. You can get fairly high throughput with M1 chips. The CNN is accelerated by the neural chip and the language model is accelerated by the GPU.

bee_rider · on July 13, 2022

I'm not sure if it is related, but I noticed recently while taking a photo of some really poorly lit text that the iPhone camera managed to pick it up and enhance it into legibility. Impressive feature and nice attention to detail on their part.

reiichiroh · on July 13, 2022

This is Apple’s LiveText feature and it’s coming to video with iOS 16 as well as automatic object selection and masking!

pronoiac · on July 12, 2022

I went looking for a similar comparison a few months ago, and saw this: https://research.aimultiple.com/ocr-accuracy/ It compared ABBYY FineReader 15, Amazon Textract, Google Cloud Platform Vision API, Microsoft Azure Computer Vision API, and Tesseract OCR Engine. I ended up using OCRmyPDF / Tesseract out of convenience, but doing a second pass with Google Cloud Vision, AWS Textract, or Abbyy is somewhere on my to-do list.

ce4 · on July 12, 2022

I went with this instead of OCRmyPDF:

https://gitlab.com/kebekus/scantools/

pronoiac · on July 12, 2022

I hadn't heard of scantools! Do you have anything to say about it? Why you chose it, features, etc.?

mcswell · on July 13, 2022

Several years ago, we did a project attempting to develop methods to OCR bilingual dictionaries. We just used Tesseract, because we were trying to develop methods to put stuff into particular fields (headword, part of speech, translations etc.), not compare OCR methods. As you might guess, there were lots of problems. But what really surprised me was that it was completely inaccurate in detecting bold characters--whereas I could detect bolding while standing far enough away from an image that I couldn't make out individual characters. And bold detection was crucial for parsing out some of the fields. (A more recent version of Tesseract doesn't even try to detect bold, afaict.)

We had another project later on aimed at simply detecting bold text, some success. But very little literature on this topic. Anyone know of OCR tools that do detect bolding?

nicodjimenez · on July 12, 2022

For STEM applications, nothing beats Mathpix OCR.

FB Research uses it, London Stock Exchange uses it, Chegg uses it (in fact even recently transitioned to Mathpix OCR from Google vision), and many, many other companies and individuals.

Disclaimer: I'm the founder.

albatrosstrophy · on July 12, 2022

How about foreign languages? I've never had one good enough for Arabic. 3 years ago when needed it for a project, no OCR I found could read a properly scanned Arabic page. Had to go on Fiverr and paid a transcriber instead.

bhaney · on July 13, 2022

I used ABBYY Finereader around 8 years ago to OCR an old EE textbook, and I was really impressed with the results back then. I haven't heard any mention of the company since then until now, so it's interesting to see that they still seem to have some of the best available OCR tech. I've since tried to use Tesseract for small OCR jobs several times over the last few years, and have never found its results to be even remotely usable (which is a real shame).

noodlesUK · on July 12, 2022

What I really want is something with a similar set of convenient APIs and CLIs like ocrmypdf [1] that supports some of the more recent ML based systems. Ocrmypdf has really good ergonomics for me in terms of ease of scripting.

Something like DocTR [2] with the same api would be fantastic.

[1] https://ocrmypdf.readthedocs.io/en/latest/

[2] https://mindee.github.io/doctr/

iisan7 · on July 12, 2022

What do folks think about these document types as a corpus for comparing tools? It's missing images and handwriting samples, but those types of documents might just be too variable to make conclusions about.

I remember Baidu's OCR giving excellent English results, but it looks like their API is deprecated now. Out of curiousity, I ran these samples through easyOCR by JaidedAI. Results at https://pastebin.com/RjzVd5Sf.

sgc · on July 12, 2022

I OCR books, so they are not a good sample. I would want to compare at least 10 pages per sample, with more typical problems such as skewed, rounded pages from photos, artifacts, damaged source pages (tears and creases) etc. They do reproduce some problems with changing fonts and layout, but a big piece of the puzzle is custom dictionaries and layout training. It's fine for a once over, but not a deep dive.

llanowarelves · on July 12, 2022

What are your favorites overall for book scanning? I'm building a DIY scanner and have only briefly considered the actual software I'll use in the processing pipeline. FOSS or API tooling preferred unless proprietary packages are significantly better.

sgc · on July 14, 2022

ABBYY has been much better than anything else due to its abilities to fine tune layout, recognition, and export parameters. I follow threads like this, always looking for the latest and greatest, but nothing else is worth the time for a smaller organization. We scan several languages. If you are English only, I can't answer about recognition since I don't look. But layout and export are mission critical, and worth a few hundred bucks if you can afford it.

llanowarelves · on July 15, 2022

Thanks for the response. Which ABBYY product(s) is this? I'm a little confused by their website, seems like they offer quite a lot of combinations of things.

sgc · on July 17, 2022

We use an old stand-alone version of FineReader. I cannot tell you if their current offerings are worth the price.

longrod · on July 12, 2022

Found this comparison while researching OCR. It doesn't have the latest libraries like PaddleOCR but the performance of different OCR libraries is still quite apparent.

ducktective · on July 12, 2022

I think among the easy-to-use FOSS CLI tools, the competition is between Tesseract and PaddlePaddle. I'm interested to know how they fare against each other. I'm mainly interested in using them in `ocrmypdf`

mdp2021 · on July 12, 2022

If you prefer not to install it at this stage, there is a chance that PaddlePaddle has a web interface - maybe you could try and see the results against a document also processed with Tesseract (also those in the article, if the scans are available).

Rochus · on July 12, 2022

Interesting report. As far as I understand in total no of the systems was really better in all categories (or did I miss something?). A summary would have been helpful. Also it would be interesting whether the neural network based or the traditional Tesseract engine was used. I did similar experiments for a project six years ago and ended up with Tesseract and a custom traineddata file.

nammi · on July 12, 2022

Does anyone use OCR to convert BluRay subtitles (.sup) to plaintext .srt files? I've used tools like SupRip and BDSup2Sub, but they've all required pretty significant cleanup afterwards. 'l', '1', and 'I' especially get mixed up a lot

leokennis · on July 12, 2022

Assuming all these subtitles (at least per movie) are in the same font, isn’t it enough to tell/correct it once that “this is an I” and “this is a 1”, and then it knows for the entire .sup?

nammi · on July 12, 2022

That's what I figured, but in practice I wasn't able to do that without having to manually enter most subtitles. The tools are relatively old, so I was hoping something newer had figured it out

solardev · on July 12, 2022

Wouldn't it be easier to just find it on www.opensubtitles.org?