I had spectacular results from AWS Textract recently - which when this article was written (2019) wasn't yet openly available.
I fed it thousands of pages of historical scanned documents - including handwritten journals from the 1800s - and it could read them better than I could!
I built a tool to use it (since running it in bulk against PDFs in a bucket took a few too many steps) and wrote about my experiences with it here: https://simonwillison.net/2022/Jun/30/s3-ocr/
For some comparison, I recently did an OCR comparison for some work for a professor. To set some context, all documents were 1960s era typed or handwritten documents in English, specifically from this archive - http://allenarchive.iac.gatech.edu/. I hand transcribed 50 documents to use as a base comparison and ran them through the various OCR engines getting the results below.
Leven is Levenshtein Distance. Overall is a weighted average of typed vs handwritten, 90/10 if I recall correctly. All results were run on my personal machine with a 5950X, 128 GB RAM, and a RTX 3080.
From my analysis the Amazon Textract was excellent, the best of all the paid ones, and while TrOCR and PaddleOCR were the best FOSS ones, the issue with them is that they require a GPU while Tesseract I could use on CPU alone. For instance to OCR all 50 documents.
TrOCR is great if you need to do a few or have GPUs to burn, but Tesseract is by far better if you need good enough for a large volume of documents, and for my project the intent was to make a software plugin that could be sent to libraries/universities, CPU is king.
I mean to for sure, but the project isn't done yet, there's some NLP work we're doing with the results of the OCR and I really want to do a full series going over all we've done rather than one and then two months later another one.
I second the first users sentiment. Things are going to change permanently into all eternity, better and better solutions come around, existing solutions get better and better - I would love reading a blog post on your current state of research already
The iOS / Apple OCR Swift API is drastically better than the ones I’ve tried online (eg. Microsoft) or the open source ones (Tesseract). Highly recommended. You can get fairly high throughput with M1 chips. The CNN is accelerated by the neural chip and the language model is accelerated by the GPU.
I'm not sure if it is related, but I noticed recently while taking a photo of some really poorly lit text that the iPhone camera managed to pick it up and enhance it into legibility. Impressive feature and nice attention to detail on their part.
I went looking for a similar comparison a few months ago, and saw this: https://research.aimultiple.com/ocr-accuracy/
It compared ABBYY FineReader 15,
Amazon Textract,
Google Cloud Platform Vision API,
Microsoft Azure Computer Vision API, and
Tesseract OCR Engine. I ended up using OCRmyPDF / Tesseract out of convenience, but doing a second pass with Google Cloud Vision, AWS Textract, or Abbyy is somewhere on my to-do list.
Several years ago, we did a project attempting to develop methods to OCR bilingual dictionaries. We just used Tesseract, because we were trying to develop methods to put stuff into particular fields (headword, part of speech, translations etc.), not compare OCR methods. As you might guess, there were lots of problems. But what really surprised me was that it was completely inaccurate in detecting bold characters--whereas I could detect bolding while standing far enough away from an image that I couldn't make out individual characters. And bold detection was crucial for parsing out some of the fields. (A more recent version of Tesseract doesn't even try to detect bold, afaict.)
We had another project later on aimed at simply detecting bold text, some success. But very little literature on this topic. Anyone know of OCR tools that do detect bolding?
FB Research uses it, London Stock Exchange uses it, Chegg uses it (in fact even recently transitioned to Mathpix OCR from Google vision), and many, many other companies and individuals.
How about foreign languages? I've never had one good enough for Arabic. 3 years ago when needed it for a project, no OCR I found could read a properly scanned Arabic page. Had to go on Fiverr and paid a transcriber instead.
I used ABBYY Finereader around 8 years ago to OCR an old EE textbook, and I was really impressed with the results back then. I haven't heard any mention of the company since then until now, so it's interesting to see that they still seem to have some of the best available OCR tech. I've since tried to use Tesseract for small OCR jobs several times over the last few years, and have never found its results to be even remotely usable (which is a real shame).
What I really want is something with a similar set of convenient APIs and CLIs like ocrmypdf [1] that supports some of the more recent ML based systems. Ocrmypdf has really good ergonomics for me in terms of ease of scripting.
Something like DocTR [2] with the same api would be fantastic.
What do folks think about these document types as a corpus for comparing tools? It's missing images and handwriting samples, but those types of documents might just be too variable to make conclusions about.
I remember Baidu's OCR giving excellent English results, but it looks like their API is deprecated now. Out of curiousity, I ran these samples through easyOCR by JaidedAI. Results at https://pastebin.com/RjzVd5Sf.
I OCR books, so they are not a good sample. I would want to compare at least 10 pages per sample, with more typical problems such as skewed, rounded pages from photos, artifacts, damaged source pages (tears and creases) etc. They do reproduce some problems with changing fonts and layout, but a big piece of the puzzle is custom dictionaries and layout training. It's fine for a once over, but not a deep dive.
What are your favorites overall for book scanning? I'm building a DIY scanner and have only briefly considered the actual software I'll use in the processing pipeline. FOSS or API tooling preferred unless proprietary packages are significantly better.
ABBYY has been much better than anything else due to its abilities to fine tune layout, recognition, and export parameters. I follow threads like this, always looking for the latest and greatest, but nothing else is worth the time for a smaller organization. We scan several languages. If you are English only, I can't answer about recognition since I don't look. But layout and export are mission critical, and worth a few hundred bucks if you can afford it.
Thanks for the response. Which ABBYY product(s) is this? I'm a little confused by their website, seems like they offer quite a lot of combinations of things.
Found this comparison while researching OCR. It doesn't have the latest libraries like PaddleOCR but the performance of different OCR libraries is still quite apparent.
I think among the easy-to-use FOSS CLI tools, the competition is between Tesseract and PaddlePaddle. I'm interested to know how they fare against each other. I'm mainly interested in using them in `ocrmypdf`
If you prefer not to install it at this stage, there is a chance that PaddlePaddle has a web interface - maybe you could try and see the results against a document also processed with Tesseract (also those in the article, if the scans are available).
Interesting report. As far as I understand in total no of the systems was really better in all categories (or did I miss something?). A summary would have been helpful. Also it would be interesting whether the neural network based or the traditional Tesseract engine was used. I did similar experiments for a project six years ago and ended up with Tesseract and a custom traineddata file.
Does anyone use OCR to convert BluRay subtitles (.sup) to plaintext .srt files? I've used tools like SupRip and BDSup2Sub, but they've all required pretty significant cleanup afterwards. 'l', '1', and 'I' especially get mixed up a lot
Assuming all these subtitles (at least per movie) are in the same font, isn’t it enough to tell/correct it once that “this is an I” and “this is a 1”, and then it knows for the entire .sup?
That's what I figured, but in practice I wasn't able to do that without having to manually enter most subtitles. The tools are relatively old, so I was hoping something newer had figured it out
I fed it thousands of pages of historical scanned documents - including handwritten journals from the 1800s - and it could read them better than I could!
I built a tool to use it (since running it in bulk against PDFs in a bucket took a few too many steps) and wrote about my experiences with it here: https://simonwillison.net/2022/Jun/30/s3-ocr/